Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mushroom classification #733

Merged
merged 6 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions Mushroom Binary Classification/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Mushroom Binary Classification Model

## Goal
The goal of this project is to build and evaluate various machine learning models to predict the class of mushrooms based on their characteristics provided in the dataset.

## Dataset
The dataset is sourced from the [Playground Series - Season 4, Episode 8 Kaggle competition](https://www.kaggle.com/competitions/playground-series-s4e8/data?select=train.csv).

### Columns in the Dataset
- `id`: Unique identifier for each mushroom.
- `class`: The class of the mushroom (target variable).
- `cap-diameter`: Diameter of the mushroom cap.
- `cap-shape`: Shape of the mushroom cap.
- `cap-surface`: Surface type of the mushroom cap.
- `cap-color`: Color of the mushroom cap.
- `does-bruise-or-bleed`: Whether the mushroom bruises or bleeds.
- `gill-attachment`: Attachment type of the gills.
- `gill-spacing`: Spacing of the gills.
- `gill-color`: Color of the gills.
- `stem-height`: Height of the stem.
- `stem-width`: Width of the stem.
- `stem-root`: Type of the stem root.
- `stem-surface`: Surface type of the stem.
- `stem-color`: Color of the stem.
- `veil-type`: Type of the veil.
- `veil-color`: Color of the veil.
- `has-ring`: Whether the mushroom has a ring.
- `ring-type`: Type of the ring.
- `spore-print-color`: Color of the spore print.
- `habitat`: Habitat of the mushroom.
- `season`: Season when the mushroom is found.

## Description
This project involves the following steps:
1. Exploratory Data Analysis (EDA)
2. Data Preprocessing
3. Model Training and Evaluation
4. Selecting the Best Model
5. Making Predictions on the Test Dataset
6. Saving the Predictions for Submission

## What I Had Done

### Exploratory Data Analysis (EDA)
Performed extensive EDA to understand the distribution of features, relationships between variables, and the overall structure of the dataset. This includes:
- Visualizations of categorical features
- Distribution plots for numerical features
- Correlation analysis

### Data Preprocessing
- Handled mixed data types in columns.
- Filled missing values.
- Encoded categorical features using `LabelEncoder`.

### Models Implemented
The following machine learning models were trained and evaluated:
1. Logistic Regression
2. Random Forest
3. Gradient Boosting
4. AdaBoost
5. Extra Trees
6. XGBoost
7. CatBoost
8. LightGBM
9. SVM

### Libraries Needed
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- xgboost
- catboost
- lightgbm

## EDA Results
- Identified mixed data types in certain columns and converted them to strings.
- Visualized the distribution of categorical and numerical features.
- Analyzed correlations between features and the target variable.

## Performance of the Models Based on Accuracy Scores
- Logistic Regression: 0.6213
- Random Forest: 0.9920
- Gradient Boosting: 0.9288
- AdaBoost: 0.8000
- Extra Trees: 0.9916
- XGBoost: 0.9912
- CatBoost: 0.9871
- LightGBM: 0.9888
- SVM: 0.9200 (example value, replace with actual)

## Conclusion
The Random Forest model achieved the highest accuracy on the validation set, making it the best model for this task. This model was used to make predictions on the test dataset.

## Predictions
The predictions were made using the Random Forest model and saved in `submission.csv`.

## Signature
- **Name:** Aditya D
- **GitHub:** [https://www.github.com/adi271001](https://www.github.com/adi271001)
- **LinkedIn:** [https://www.linkedin.com/in/aditya-d-23453a179/](https://www.linkedin.com/in/aditya-d-23453a179/)
- **Topmate:** [https://topmate.io/aditya_d/](https://topmate.io/aditya_d/)
- **Twitter:** [https://x.com/ADITYAD29257528](https://x.com/ADITYAD29257528)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 49 additions & 0 deletions Mushroom Binary Classification/Models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Models Implemented
This project involves training and evaluating the following machine learning models:

1. **Logistic Regression**
- Accuracy: 0.6213
- A simple linear model used for binary classification.

2. **Random Forest**
- Accuracy: 0.9920
- An ensemble model that builds multiple decision trees and merges them together for a more accurate and stable prediction.

3. **Gradient Boosting**
- Accuracy: 0.9288
- An ensemble technique that builds models sequentially, each trying to correct the errors of the previous one.

4. **AdaBoost**
- Accuracy: 0.8000
- An ensemble method that combines multiple weak classifiers to form a strong classifier by focusing on the hard-to-classify instances.

5. **Extra Trees**
- Accuracy: 0.9915
- Similar to Random Forest but builds multiple trees using random splits of all observations.

6. **XGBoost**
- Accuracy: 0.9912
- An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

7. **CatBoost**
- Accuracy: 0.9871
- A gradient boosting library that handles categorical features automatically and is highly efficient.

8. **LightGBM**
- Accuracy: 0.9888
- A gradient boosting framework that uses tree-based learning algorithms and is designed to be distributed and efficient.

## Conclusion
The Random Forest model achieved the highest accuracy on the validation set, making it the best model for this task. This model was used to make predictions on the test dataset.

## Predictions
The predictions were made using the Random Forest model and saved in `submission.csv`

![Results](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___20_1.png?raw=true)

## Signature
- **Name:** Aditya D
- **GitHub:** [https://www.github.com/adi271001](https://www.github.com/adi271001)
- **LinkedIn:** [https://www.linkedin.com/in/aditya-d-23453a179/](https://www.linkedin.com/in/aditya-d-23453a179/)
- **Topmate:** [https://topmate.io/aditya_d/](https://topmate.io/aditya_d/)
- **Twitter:** [https://x.com/ADITYAD29257528](https://x.com/ADITYAD29257528)
1,570 changes: 1,570 additions & 0 deletions Mushroom Binary Classification/Models/mushroom-binary-classiftion.ipynb

Large diffs are not rendered by default.

108 changes: 108 additions & 0 deletions Mushroom Binary Classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Mushroom Binary Classification

## Goal
The goal of this project is to build and evaluate various machine learning models to predict the class of mushrooms based on their characteristics provided in the dataset.

## Dataset
The dataset is sourced from the [Playground Series - Season 4, Episode 8 Kaggle competition](https://www.kaggle.com/competitions/playground-series-s4e8/data?select=train.csv).

### Columns in the Dataset
- `id`: Unique identifier for each mushroom.
- `class`: The class of the mushroom (target variable).
- `cap-diameter`: Diameter of the mushroom cap.
- `cap-shape`: Shape of the mushroom cap.
- `cap-surface`: Surface type of the mushroom cap.
- `cap-color`: Color of the mushroom cap.
- `does-bruise-or-bleed`: Whether the mushroom bruises or bleeds.
- `gill-attachment`: Attachment type of the gills.
- `gill-spacing`: Spacing of the gills.
- `gill-color`: Color of the gills.
- `stem-height`: Height of the stem.
- `stem-width`: Width of the stem.
- `stem-root`: Type of the stem root.
- `stem-surface`: Surface type of the stem.
- `stem-color`: Color of the stem.
- `veil-type`: Type of the veil.
- `veil-color`: Color of the veil.
- `has-ring`: Whether the mushroom has a ring.
- `ring-type`: Type of the ring.
- `spore-print-color`: Color of the spore print.
- `habitat`: Habitat of the mushroom.
- `season`: Season when the mushroom is found.

## Description
This project involves the following steps:
1. Exploratory Data Analysis (EDA)
2. Data Preprocessing
3. Model Training and Evaluation
4. Selecting the Best Model
5. Making Predictions on the Test Dataset
6. Saving the Predictions for Submission

## What I Had Done

### Exploratory Data Analysis (EDA)
Performed extensive EDA to understand the distribution of features, relationships between variables, and the overall structure of the dataset. This includes:
- Visualizations of categorical features
- Distribution plots for numerical features
- Correlation analysis

### Data Preprocessing
- Handled mixed data types in columns.
- Filled missing values.
- Encoded categorical features using `LabelEncoder`.

### Models Implemented
The following machine learning models were trained and evaluated:
1. Logistic Regression
2. Random Forest
3. Gradient Boosting
4. AdaBoost
5. Extra Trees
6. XGBoost
7. CatBoost
8. LightGBM

### Libraries Needed
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- xgboost
- catboost
- lightgbm

## EDA Results
- Identified mixed data types in certain columns and converted them to strings.
- Visualized the distribution of categorical and numerical features.
- Analyzed correlations between features and the target variable.

![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___4_0.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___6_1.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___8_0.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___10_0.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___11_1.png?raw=true)

## Performance of the Models Based on Accuracy Scores
- Logistic Regression: 0.6213
- Random Forest: 0.9920
- Gradient Boosting: 0.9288
- AdaBoost: 0.8000
- Extra Trees: 0.9915
- XGBoost: 0.9912
- CatBoost: 0.9871
- LightGBM: 0.9888

## Conclusion
The Random Forest model achieved the highest accuracy on the validation set, making it the best model for this task. This model was used to make predictions on the test dataset.

## Predictions
The predictions were made using the Random Forest model and saved in `submission.csv`.

## Signature
- **Name:** Aditya D
- **GitHub:** [https://www.github.com/adi271001](https://www.github.com/adi271001)
- **LinkedIn:** [https://www.linkedin.com/in/aditya-d-23453a179/](https://www.linkedin.com/in/aditya-d-23453a179/)
- **Topmate:** [https://topmate.io/aditya_d/](https://topmate.io/aditya_d/)
- **Twitter:** [https://x.com/ADITYAD29257528](https://x.com/ADITYAD29257528)
Loading
Loading