Water-Quality-Analysis-ML

Project Overview

This project aims to predict water quality using a dataset of various physicochemical, socio-economic, and environmental factors. By leveraging machine learning models, we classify water samples as Clean or Dirty based on their attributes. The final model is deployed using a Streamlit-based web app, providing an interactive UI for predictions.

Deployed Application: Water Quality Analysis App

Dataset

Source: Kaggle - Water Quality Dataset
Description: The dataset contains measurements such as population density, waste management indices, development index, and other features used to assess water quality.

Goals

Explore the dataset and perform data visualization.
Preprocess the data by handling missing values, scaling features, and engineering new features:
Train machine learning models to predict water quality.
Evaluate and compare model performance.
Deploy the final model using a Streamlit-based web app.

Features

Population Density: Estimated as the number of people per unit area in a given region.
Waste Index: A derived feature that measures waste composition against recycling rates.
Development Index: Calculated using GDP and literacy rate, reflecting socio-economic factors affecting water quality.

Data Preprocessing

The preprocessing steps include:

Handling missing values and normalizing data where necessary.
Creating derived features such as the Waste Index and Development Index using formulas:

$$ Waste\ Index = \frac{(Max\ Waste\ Composition + Other\ Composition)}{Recycling\ Percentage} $$

$$ Development\ Index = \text{GDP} \times \text{Literacy Rate} $$

Dropping unnecessary columns like 'Country', 'GDP', and 'TouristMean' to streamline the data.
Feature engineering to ensure all relevant attributes are used for training.

Model Documentation

Model Diagrams

Overview

The final model used in this project is an XGBoost Classifier, selected for its superior performance in handling tabular data with complex relationships. Below are the details about the model and its performance:

Model Selection Process

Candidate Models:
- AdaBoostClassifier
- RandomForestClassifier
- SVC (Support Vector Classifier)
- KNeighborsClassifier
- Naive Bayes Classifier (GaussianNB)
- XGBClassifier (eXtreme Gradient Boost Classifier)
Evaluation Criteria:
- Accuracy
- Precision, Recall, and F1-Score
- Area Under the ROC Curve (AUC-ROC)
- Model interpretability and feature importance
Reason for Choosing XGBoost:
- High predictive accuracy on tabular data.
- Handles missing values effectively.
- Built-in feature importance metrics.
- Optimized for speed and performance.

Hyperparameter Tuning

The model's hyperparameters were tuned using Grid Search and Cross-Validation to optimize performance. Below are the final hyperparameter settings:

learning_rate: 0.1 --> [0.01, 0.05, 0.1]
max_depth: 5 --> [3, 5, 7]
n_estimators: 200 --> [100, 200, 300]
subsample:0.8 --> [0.8, 0.9, 1.0]

Training and Evaluation

Training Dataset:
- The dataset was split into 80% training and 20% testing sets.
- Cross-validation (5-fold) was used to validate model performance during training.
Performance Metrics:
- Precision: 99.11%
- Recall: 49.01%
- F1-Score: 65.58%
- Cross-Validation Score: 67.05%
Confusion Matrix:

$$\begin{bmatrix} 53 & 26 \\ 3003 & 2886 \end{bmatrix}$$

True Positive (TP): 53 (Predicted Clean and actually Clean)
False Negative (FN): 26 (Predicted Dirty but actually Clean)
False Positive (FP): 3003 (Predicted Clean but actually Dirty)
True Negative (TN): 2886 (Predicted Dirty and actually Dirty)

Feature Importance

The XGBoost model provides insights into feature importance based on the number of splits a feature contributes to the decision tree ensemble. Below are the top features influencing predictions:

Feature	Importance (%)
Development Index	59.78%
Waste Index	32.65%
Population Density	7.55%

Model Limitations

While XGBoost performed best overall, the following limitations were observed:

Requires significant computational resources for training on large datasets.
May overfit on smaller datasets without proper regularization.

Future Improvements

Experiment with ensemble models combining XGBoost with other classifiers for improved performance.
Explore more advanced hyperparameter tuning methods such as Bayesian Optimization.
Investigate additional derived features to enhance predictive accuracy.

Installation

Clone the repository:

git clone https://github.com/Programming-Sai/Water-Quality-Analysis-ML.git
cd Water-Quality-Analysis-ML

Set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Folder Structure

water-quality-prediction/
│
├── data/
│   └── water_quality.csv        # The dataset
│   └── processed_data.csv       # The Processed dataset
│
│
├── notebooks/
│   └── data_exploration.ipynb   # For data exploration and visualization
│   └── model_building.ipynb     # For training the model
|
├── deployment-ui/
|   └── data/
│   |     └── fav.ico
│   |     └── XGBoost_model.joblib
|   |
│   ├── app.py                   # Streamlit app for deployment
│   └── assets/                  # Custom styles and images
|
│
├── requirements.txt             # Python dependencies
├── .gitignore                   # Python gitignore
└── README.md                    # Project overview

Usage

Explore the dataset:
- Open and run notebooks/data_exploration.ipynb to understand the data and visualize distributions.
Train the model:
- Use notebooks/model_building.ipynb or src/train.py to train and evaluate the machine learning models.
Predict water quality:
- Use src/predict.py to make predictions on new data.
Deploy the model:
- Run the Streamlit app locally:
```
streamlit run deployment-ui/app.py
```
- Alternatively, use the deployed version: Water Quality Analysis App.

Key Libraries

pandas: Data manipulation
numpy: Numerical computations
scikit-learn: Machine learning models
xgboost: Gradient boosting for tabular data
matplotlib & seaborn: Data visualization
Streamlit: Model deployment and interactive UI

Model Performance

Below are the performance metrics and comparison of models tested during development.

Accuracy Comparison
(Insert image of accuracy comparison between different models here)

Confusion Matrix for XGBoost
(Insert confusion matrix image for the final XGBoost model here)

Contributors

Steps to Set It Up

Create a new repository on GitHub with the name water-quality-prediction.

Initialize your project folder locally and link it to the GitHub repo:

git init
git remote add origin https://github.com/Programming-Sai/Water-Quality-Analysis-ML.git
git branch -M main
git add .
git commit -m "Initial commit"
git push -u origin main

Application Features

Interactive UI: Allows users to input values for Population Density, Waste Index, and Development Index.
Tooltips for Guidance: Each input field includes descriptions and formulas to assist users in understanding the required values.
Real-Time Predictions: Displays the prediction result (Clean or Dirty) with color-coded feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Water-Quality-Analysis-ML

Project Overview

Dataset

Goals

Features

Data Preprocessing

Model Documentation

Overview

Model Selection Process

Hyperparameter Tuning

Training and Evaluation

Feature Importance

Model Limitations

Future Improvements

Installation

Folder Structure

Usage

Key Libraries

Model Performance

Contributors

Steps to Set It Up

Application Features

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.devcontainer		.devcontainer
assets		assets
data		data
deployment-ui		deployment-ui
notebooks		notebooks
.gitignore		.gitignore
.removeRemotely.sh		.removeRemotely.sh
README.md		README.md
requirements.txt		requirements.txt

Programming-Sai/Water-Quality-Analysis-ML

Folders and files

Latest commit

History

Repository files navigation

Water-Quality-Analysis-ML

Project Overview

Dataset

Goals

Features

Data Preprocessing

Model Documentation

Overview

Model Selection Process

Hyperparameter Tuning

Training and Evaluation

Feature Importance

Model Limitations

Future Improvements

Installation

Folder Structure

Usage

Key Libraries

Model Performance

Contributors

Steps to Set It Up

Application Features

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages