This project implements a spam detection system for SMS messages using various machine learning techniques. It preprocesses SMS data, performs exploratory analysis, and trains multiple classifiers, including ensemble methods, to classify messages as either spam or ham (not spam). The model is saved for future use and deployed on Streamlit to create an interactive web application for better user experience.
To run this project, you need to install the following libraries:
pip install pandas numpy matplotlib seaborn nltk scikit-learn wordcloud streamlit
- Import Necessary Libraries:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import nltk
- Load the Dataset:
df = pd.read_csv("spam.csv", encoding='ISO-8859-1')
- Data Cleaning: Drop unnecessary columns, rename them, and handle duplicates.
- Visualizations to understand the distribution of spam and ham messages.
- Analysis of message lengths, word counts, and sentence counts.
- Text cleaning using NLP techniques to preprocess the messages.
- Converting text data into numerical vectors using
CountVectorizer
andTfidfVectorizer
.
- Split the dataset into training and testing sets.
- Train various classifiers including:
- Naive Bayes (Gaussian, Multinomial, Bernoulli)
- Logistic Regression
- Support Vector Machines (SVM)
- Random Forest
- Extra Trees Classifier
- Evaluate models based on accuracy, precision, and recall.
- Utilize ensemble methods like Voting Classifier and Stacking Classifier to improve predictions.
- Save the trained model and vectorizer using
pickle
for future use:import pickle as pkl pkl.dump(tfidf, open("Vectorizer.pkl", "wb")) pkl.dump(clf, open("Model.pkl", "wb"))
- Build a web application using Streamlit to allow users to input SMS messages for classification.
- The app preprocesses the input, vectorizes it, and provides a prediction on whether it is spam or not.
The spam detection system was developed using a comprehensive dataset, leveraging multiple machine learning algorithms and ensemble techniques for improved accuracy. Visualizations were used to highlight the strengths and weaknesses of each model, and the final model achieved high performance with the Stacking Classifier.
To run the Streamlit application, use the command:
streamlit run app.py
This project showcases a practical application of machine learning in natural language processing, emphasizing the importance of feature extraction and model selection. Future work could explore advanced deep learning techniques for even better performance.