Analyzing Cyberbullying Tweets with LSTM Networks

This project aims to develop a tool for identifying cyberbullying tweets and classifying them based on various categories such as gender, religion, age, ethnicity, and other types of cyberbullying. The primary objectives include:

Utilizing the Cyberbullying Classification Dataset sourced from Kaggle.
Conducting data cleaning procedures to enhance data quality.
Applying data preprocessing techniques to prepare the cleaned data for analysis.
Constructing a Recurrent Neural Network (RNN) model using Long Short-Term Memory (LSTM) layers and evaluating its performance on a separate test dataset.
Implementing a client-facing API using Flask for seamless integration and usability.

Technologies and Resources

Python Version: 3.10
Libraries: numpy, pandas, matplotlib, seaborn, nltk, tensorflow, scikit-learn, flask, json
Flask API Setup:
- pip install -r requirements.txt
- conda env create -n <ENVNAME> -f environment.yaml (Anaconda environment)
Dataset: https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification

Data Acquisition

The project relies on the Cyberbullying Classification Dataset obtained from Kaggle. This dataset comprises over 47,000 labeled tweets categorized into distinct classes of cyberbullying.

Not Cyberbullying
Gender
Religion
Other types of cyberbullying
Age
Ethnicity

Data Cleaning

A custom Python script is developed to perform rigorous data cleaning processes. These processes involve:

Removal of punctuation marks
Elimination of numerical characters
Conversion of text to lowercase
Elimination of stop words
Lemmatization/Stemming of words
Removal of URLs

Data Preprocessing

To prepare the cleaned tweets for analysis, the TextVectorization layer from Keras is applied. This layer facilitates one-hot encoding of text, resulting in a list of encoded integers representing individual words (or tokens) in the input string. Additionally, sequences are padded to ensure uniform length.

Model Building

Train-Test Split: Data is divided into 80% training and 20% testing sets.
Bidirectional LSTM Model: Build an RNN architecture utilizing Bidirectional LSTM layers.
Evaluation: Employ "categorical_crossentropy" for loss measurement and "RMSprop" for optimization.

Model Visualization:

Model Performance:

Productionization

A Flask-based user interface (UI) allows users to submit tweets and receive cyberbullying type predictions in real-time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Analyzing Cyberbullying Tweets with LSTM Networks

Technologies and Resources

Data Acquisition

Data Cleaning

Data Preprocessing

Model Building

Productionization

Files

README.md

Latest commit

History

README.md

File metadata and controls

Analyzing Cyberbullying Tweets with LSTM Networks

Technologies and Resources

Data Acquisition

Data Cleaning

Data Preprocessing

Model Building

Productionization