Skip to content

Latest commit

 

History

History
67 lines (44 loc) · 3.17 KB

File metadata and controls

67 lines (44 loc) · 3.17 KB

Analyzing Cyberbullying Tweets with LSTM Networks

This project aims to develop a tool for identifying cyberbullying tweets and classifying them based on various categories such as gender, religion, age, ethnicity, and other types of cyberbullying. The primary objectives include:

  • Utilizing the Cyberbullying Classification Dataset sourced from Kaggle.
  • Conducting data cleaning procedures to enhance data quality.
  • Applying data preprocessing techniques to prepare the cleaned data for analysis.
  • Constructing a Recurrent Neural Network (RNN) model using Long Short-Term Memory (LSTM) layers and evaluating its performance on a separate test dataset.
  • Implementing a client-facing API using Flask for seamless integration and usability.

Technologies and Resources

Data Acquisition

The project relies on the Cyberbullying Classification Dataset obtained from Kaggle. This dataset comprises over 47,000 labeled tweets categorized into distinct classes of cyberbullying.

  • Not Cyberbullying
  • Gender
  • Religion
  • Other types of cyberbullying
  • Age
  • Ethnicity

alt text

Data Cleaning

A custom Python script is developed to perform rigorous data cleaning processes. These processes involve:

  • Removal of punctuation marks
  • Elimination of numerical characters
  • Conversion of text to lowercase
  • Elimination of stop words
  • Lemmatization/Stemming of words
  • Removal of URLs

Data Preprocessing

To prepare the cleaned tweets for analysis, the TextVectorization layer from Keras is applied. This layer facilitates one-hot encoding of text, resulting in a list of encoded integers representing individual words (or tokens) in the input string. Additionally, sequences are padded to ensure uniform length.

Model Building

  1. Train-Test Split: Data is divided into 80% training and 20% testing sets.
  2. Bidirectional LSTM Model: Build an RNN architecture utilizing Bidirectional LSTM layers.
  3. Evaluation: Employ "categorical_crossentropy" for loss measurement and "RMSprop" for optimization.

Model Visualization:

alt text

Model Performance:

alt text

Productionization

A Flask-based user interface (UI) allows users to submit tweets and receive cyberbullying type predictions in real-time.

alt text