This repository contains a Jupyter Notebook program for classifying tweet data from Twitter using the Multinomial Naive Bayes algorithm and TF-IDF, utilizing the Scikit-learn library.
This folder contains the scripts used for my thesis, with the following workflow:
- Data Crawling – Collecting tweet data
- Data Preprocessing & Visualization – Cleaning and analyzing data
- NaN Detection (Optional) – Handling missing values caused by formatting issues (e.g., missing commas)
- Multinomial Naive Bayes Classification – Training and evaluating the model
This folder includes three datasets:
- Dataset – The raw data, manually labeled
- Preprocessed Dataset – Cleaned data, without stemming
- Preprocessed + Stemmed Dataset – Cleaned data with stemming applied
🚀 Feel free to explore and contribute!