This project aims to develop a tool for identifying cyberbullying tweets and classifying them based on various categories such as gender, religion, age, ethnicity, and other types of cyberbullying. The primary objectives include:
- Utilizing the Cyberbullying Classification Dataset sourced from Kaggle.
- Conducting data cleaning procedures to enhance data quality.
- Applying data preprocessing techniques to prepare the cleaned data for analysis.
- Constructing a Recurrent Neural Network (RNN) model using Long Short-Term Memory (LSTM) layers and evaluating its performance on a separate test dataset.
- Implementing a client-facing API using Flask for seamless integration and usability.
-
Python Version: 3.10
-
Libraries: numpy, pandas, matplotlib, seaborn, nltk, tensorflow, scikit-learn, flask, json
-
Flask API Setup:
pip install -r requirements.txt
conda env create -n <ENVNAME> -f environment.yaml
(Anaconda environment)
-
Dataset: https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification
The project relies on the Cyberbullying Classification Dataset obtained from Kaggle. This dataset comprises over 47,000 labeled tweets categorized into distinct classes of cyberbullying.
- Not Cyberbullying
- Gender
- Religion
- Other types of cyberbullying
- Age
- Ethnicity
A custom Python script is developed to perform rigorous data cleaning processes. These processes involve:
- Removal of punctuation marks
- Elimination of numerical characters
- Conversion of text to lowercase
- Elimination of stop words
- Lemmatization/Stemming of words
- Removal of URLs
To prepare the cleaned tweets for analysis, the TextVectorization layer from Keras is applied. This layer facilitates one-hot encoding of text, resulting in a list of encoded integers representing individual words (or tokens) in the input string. Additionally, sequences are padded to ensure uniform length.
- Train-Test Split: Data is divided into 80% training and 20% testing sets.
- Bidirectional LSTM Model: Build an RNN architecture utilizing Bidirectional LSTM layers.
- Evaluation: Employ "categorical_crossentropy" for loss measurement and "RMSprop" for optimization.
Model Visualization:
Model Performance:
A Flask-based user interface (UI) allows users to submit tweets and receive cyberbullying type predictions in real-time.