This project aims to predict water quality using a dataset of various physicochemical, socio-economic, and environmental factors. By leveraging machine learning models, we classify water samples as Clean or Dirty based on their attributes. The final model is deployed using a Streamlit-based web app, providing an interactive UI for predictions.
Deployed Application: Water Quality Analysis App
- Source: Kaggle - Water Quality Dataset
- Description: The dataset contains measurements such as population density, waste management indices, development index, and other features used to assess water quality.
- Explore the dataset and perform data visualization.
- Preprocess the data by handling missing values, scaling features, and engineering new features:
- Train machine learning models to predict water quality.
- Evaluate and compare model performance.
- Deploy the final model using a Streamlit-based web app.
- Population Density: Estimated as the number of people per unit area in a given region.
- Waste Index: A derived feature that measures waste composition against recycling rates.
- Development Index: Calculated using GDP and literacy rate, reflecting socio-economic factors affecting water quality.
The preprocessing steps include:
- Handling missing values and normalizing data where necessary.
- Creating derived features such as the Waste Index and Development Index using formulas:
- Dropping unnecessary columns like 'Country', 'GDP', and 'TouristMean' to streamline the data.
- Feature engineering to ensure all relevant attributes are used for training.
Model Diagrams
The final model used in this project is an XGBoost Classifier, selected for its superior performance in handling tabular data with complex relationships. Below are the details about the model and its performance:
-
Candidate Models:
- AdaBoostClassifier
- RandomForestClassifier
- SVC (Support Vector Classifier)
- KNeighborsClassifier
- Naive Bayes Classifier (GaussianNB)
- XGBClassifier (eXtreme Gradient Boost Classifier)
-
Evaluation Criteria:
- Accuracy
- Precision, Recall, and F1-Score
- Area Under the ROC Curve (AUC-ROC)
- Model interpretability and feature importance
-
Reason for Choosing XGBoost:
- High predictive accuracy on tabular data.
- Handles missing values effectively.
- Built-in feature importance metrics.
- Optimized for speed and performance.
The model's hyperparameters were tuned using Grid Search and Cross-Validation to optimize performance. Below are the final hyperparameter settings:
learning_rate
: 0.1 --> [0.01, 0.05, 0.1]max_depth
: 5 --> [3, 5, 7]n_estimators
: 200 --> [100, 200, 300]subsample
:0.8 --> [0.8, 0.9, 1.0]
-
Training Dataset:
- The dataset was split into 80% training and 20% testing sets.
- Cross-validation (5-fold) was used to validate model performance during training.
-
Performance Metrics:
- Precision: 99.11%
- Recall: 49.01%
- F1-Score: 65.58%
- Cross-Validation Score: 67.05%
-
Confusion Matrix:
- True Positive (TP): 53 (Predicted Clean and actually Clean)
- False Negative (FN): 26 (Predicted Dirty but actually Clean)
- False Positive (FP): 3003 (Predicted Clean but actually Dirty)
- True Negative (TN): 2886 (Predicted Dirty and actually Dirty)
The XGBoost model provides insights into feature importance based on the number of splits a feature contributes to the decision tree ensemble. Below are the top features influencing predictions:
Feature | Importance (%) |
---|---|
Development Index | 59.78% |
Waste Index | 32.65% |
Population Density | 7.55% |
While XGBoost performed best overall, the following limitations were observed:
- Requires significant computational resources for training on large datasets.
- May overfit on smaller datasets without proper regularization.
- Experiment with ensemble models combining XGBoost with other classifiers for improved performance.
- Explore more advanced hyperparameter tuning methods such as Bayesian Optimization.
- Investigate additional derived features to enhance predictive accuracy.
-
Clone the repository:
git clone https://github.com/Programming-Sai/Water-Quality-Analysis-ML.git cd Water-Quality-Analysis-ML
-
Set up a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
water-quality-prediction/
│
├── data/
│ └── water_quality.csv # The dataset
│ └── processed_data.csv # The Processed dataset
│
│
├── notebooks/
│ └── data_exploration.ipynb # For data exploration and visualization
│ └── model_building.ipynb # For training the model
|
├── deployment-ui/
| └── data/
│ | └── fav.ico
│ | └── XGBoost_model.joblib
| |
│ ├── app.py # Streamlit app for deployment
│ └── assets/ # Custom styles and images
|
│
├── requirements.txt # Python dependencies
├── .gitignore # Python gitignore
└── README.md # Project overview
- Explore the dataset:
- Open and run
notebooks/data_exploration.ipynb
to understand the data and visualize distributions.
- Open and run
- Train the model:
- Use
notebooks/model_building.ipynb
orsrc/train.py
to train and evaluate the machine learning models.
- Use
- Predict water quality:
- Use
src/predict.py
to make predictions on new data.
- Use
- Deploy the model:
- Run the Streamlit app locally:
streamlit run deployment-ui/app.py
- Alternatively, use the deployed version: Water Quality Analysis App.
- Run the Streamlit app locally:
- pandas: Data manipulation
- numpy: Numerical computations
- scikit-learn: Machine learning models
- xgboost: Gradient boosting for tabular data
- matplotlib & seaborn: Data visualization
- Streamlit: Model deployment and interactive UI
Below are the performance metrics and comparison of models tested during development.
Accuracy Comparison
(Insert image of accuracy comparison between different models here)
Confusion Matrix for XGBoost
(Insert confusion matrix image for the final XGBoost model here)
- Create a new repository on GitHub with the name
water-quality-prediction
. - Initialize your project folder locally and link it to the GitHub repo:
git init git remote add origin https://github.com/Programming-Sai/Water-Quality-Analysis-ML.git git branch -M main git add . git commit -m "Initial commit" git push -u origin main
- Interactive UI: Allows users to input values for Population Density, Waste Index, and Development Index.
- Tooltips for Guidance: Each input field includes descriptions and formulas to assist users in understanding the required values.
- Real-Time Predictions: Displays the prediction result (Clean or Dirty) with color-coded feedback.