Skip to content

Commit

Permalink
Merge pull request #733 from adi271001/Mushroom-Classification
Browse files Browse the repository at this point in the history
Mushroom classification
  • Loading branch information
abhisheks008 authored Nov 14, 2024
2 parents 537ba2e + 08b89ae commit 6185c4c
Show file tree
Hide file tree
Showing 15 changed files with 2,079,804 additions and 0 deletions.
104 changes: 104 additions & 0 deletions Mushroom Binary Classification/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Mushroom Binary Classification Model

## Goal
The goal of this project is to build and evaluate various machine learning models to predict the class of mushrooms based on their characteristics provided in the dataset.

## Dataset
The dataset is sourced from the [Playground Series - Season 4, Episode 8 Kaggle competition](https://www.kaggle.com/competitions/playground-series-s4e8/data?select=train.csv).

### Columns in the Dataset
- `id`: Unique identifier for each mushroom.
- `class`: The class of the mushroom (target variable).
- `cap-diameter`: Diameter of the mushroom cap.
- `cap-shape`: Shape of the mushroom cap.
- `cap-surface`: Surface type of the mushroom cap.
- `cap-color`: Color of the mushroom cap.
- `does-bruise-or-bleed`: Whether the mushroom bruises or bleeds.
- `gill-attachment`: Attachment type of the gills.
- `gill-spacing`: Spacing of the gills.
- `gill-color`: Color of the gills.
- `stem-height`: Height of the stem.
- `stem-width`: Width of the stem.
- `stem-root`: Type of the stem root.
- `stem-surface`: Surface type of the stem.
- `stem-color`: Color of the stem.
- `veil-type`: Type of the veil.
- `veil-color`: Color of the veil.
- `has-ring`: Whether the mushroom has a ring.
- `ring-type`: Type of the ring.
- `spore-print-color`: Color of the spore print.
- `habitat`: Habitat of the mushroom.
- `season`: Season when the mushroom is found.

## Description
This project involves the following steps:
1. Exploratory Data Analysis (EDA)
2. Data Preprocessing
3. Model Training and Evaluation
4. Selecting the Best Model
5. Making Predictions on the Test Dataset
6. Saving the Predictions for Submission

## What I Had Done

### Exploratory Data Analysis (EDA)
Performed extensive EDA to understand the distribution of features, relationships between variables, and the overall structure of the dataset. This includes:
- Visualizations of categorical features
- Distribution plots for numerical features
- Correlation analysis

### Data Preprocessing
- Handled mixed data types in columns.
- Filled missing values.
- Encoded categorical features using `LabelEncoder`.

### Models Implemented
The following machine learning models were trained and evaluated:
1. Logistic Regression
2. Random Forest
3. Gradient Boosting
4. AdaBoost
5. Extra Trees
6. XGBoost
7. CatBoost
8. LightGBM
9. SVM

### Libraries Needed
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- xgboost
- catboost
- lightgbm

## EDA Results
- Identified mixed data types in certain columns and converted them to strings.
- Visualized the distribution of categorical and numerical features.
- Analyzed correlations between features and the target variable.

## Performance of the Models Based on Accuracy Scores
- Logistic Regression: 0.6213
- Random Forest: 0.9920
- Gradient Boosting: 0.9288
- AdaBoost: 0.8000
- Extra Trees: 0.9916
- XGBoost: 0.9912
- CatBoost: 0.9871
- LightGBM: 0.9888
- SVM: 0.9200 (example value, replace with actual)

## Conclusion
The Random Forest model achieved the highest accuracy on the validation set, making it the best model for this task. This model was used to make predictions on the test dataset.

## Predictions
The predictions were made using the Random Forest model and saved in `submission.csv`.

## Signature
- **Name:** Aditya D
- **GitHub:** [https://www.github.com/adi271001](https://www.github.com/adi271001)
- **LinkedIn:** [https://www.linkedin.com/in/aditya-d-23453a179/](https://www.linkedin.com/in/aditya-d-23453a179/)
- **Topmate:** [https://topmate.io/aditya_d/](https://topmate.io/aditya_d/)
- **Twitter:** [https://x.com/ADITYAD29257528](https://x.com/ADITYAD29257528)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 49 additions & 0 deletions Mushroom Binary Classification/Models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Models Implemented
This project involves training and evaluating the following machine learning models:

1. **Logistic Regression**
- Accuracy: 0.6213
- A simple linear model used for binary classification.

2. **Random Forest**
- Accuracy: 0.9920
- An ensemble model that builds multiple decision trees and merges them together for a more accurate and stable prediction.

3. **Gradient Boosting**
- Accuracy: 0.9288
- An ensemble technique that builds models sequentially, each trying to correct the errors of the previous one.

4. **AdaBoost**
- Accuracy: 0.8000
- An ensemble method that combines multiple weak classifiers to form a strong classifier by focusing on the hard-to-classify instances.

5. **Extra Trees**
- Accuracy: 0.9915
- Similar to Random Forest but builds multiple trees using random splits of all observations.

6. **XGBoost**
- Accuracy: 0.9912
- An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

7. **CatBoost**
- Accuracy: 0.9871
- A gradient boosting library that handles categorical features automatically and is highly efficient.

8. **LightGBM**
- Accuracy: 0.9888
- A gradient boosting framework that uses tree-based learning algorithms and is designed to be distributed and efficient.

## Conclusion
The Random Forest model achieved the highest accuracy on the validation set, making it the best model for this task. This model was used to make predictions on the test dataset.

## Predictions
The predictions were made using the Random Forest model and saved in `submission.csv`

![Results](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___20_1.png?raw=true)

## Signature
- **Name:** Aditya D
- **GitHub:** [https://www.github.com/adi271001](https://www.github.com/adi271001)
- **LinkedIn:** [https://www.linkedin.com/in/aditya-d-23453a179/](https://www.linkedin.com/in/aditya-d-23453a179/)
- **Topmate:** [https://topmate.io/aditya_d/](https://topmate.io/aditya_d/)
- **Twitter:** [https://x.com/ADITYAD29257528](https://x.com/ADITYAD29257528)
1,570 changes: 1,570 additions & 0 deletions Mushroom Binary Classification/Models/mushroom-binary-classiftion.ipynb

Large diffs are not rendered by default.

108 changes: 108 additions & 0 deletions Mushroom Binary Classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Mushroom Binary Classification

## Goal
The goal of this project is to build and evaluate various machine learning models to predict the class of mushrooms based on their characteristics provided in the dataset.

## Dataset
The dataset is sourced from the [Playground Series - Season 4, Episode 8 Kaggle competition](https://www.kaggle.com/competitions/playground-series-s4e8/data?select=train.csv).

### Columns in the Dataset
- `id`: Unique identifier for each mushroom.
- `class`: The class of the mushroom (target variable).
- `cap-diameter`: Diameter of the mushroom cap.
- `cap-shape`: Shape of the mushroom cap.
- `cap-surface`: Surface type of the mushroom cap.
- `cap-color`: Color of the mushroom cap.
- `does-bruise-or-bleed`: Whether the mushroom bruises or bleeds.
- `gill-attachment`: Attachment type of the gills.
- `gill-spacing`: Spacing of the gills.
- `gill-color`: Color of the gills.
- `stem-height`: Height of the stem.
- `stem-width`: Width of the stem.
- `stem-root`: Type of the stem root.
- `stem-surface`: Surface type of the stem.
- `stem-color`: Color of the stem.
- `veil-type`: Type of the veil.
- `veil-color`: Color of the veil.
- `has-ring`: Whether the mushroom has a ring.
- `ring-type`: Type of the ring.
- `spore-print-color`: Color of the spore print.
- `habitat`: Habitat of the mushroom.
- `season`: Season when the mushroom is found.

## Description
This project involves the following steps:
1. Exploratory Data Analysis (EDA)
2. Data Preprocessing
3. Model Training and Evaluation
4. Selecting the Best Model
5. Making Predictions on the Test Dataset
6. Saving the Predictions for Submission

## What I Had Done

### Exploratory Data Analysis (EDA)
Performed extensive EDA to understand the distribution of features, relationships between variables, and the overall structure of the dataset. This includes:
- Visualizations of categorical features
- Distribution plots for numerical features
- Correlation analysis

### Data Preprocessing
- Handled mixed data types in columns.
- Filled missing values.
- Encoded categorical features using `LabelEncoder`.

### Models Implemented
The following machine learning models were trained and evaluated:
1. Logistic Regression
2. Random Forest
3. Gradient Boosting
4. AdaBoost
5. Extra Trees
6. XGBoost
7. CatBoost
8. LightGBM

### Libraries Needed
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- xgboost
- catboost
- lightgbm

## EDA Results
- Identified mixed data types in certain columns and converted them to strings.
- Visualized the distribution of categorical and numerical features.
- Analyzed correlations between features and the target variable.

![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___4_0.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___6_1.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___8_0.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___10_0.png?raw=true)
![EDA](https://github.com/adi271001/ML-Crate/blob/Mushroom-Classification/Mushroom%20Binary%20Classification/Images/__results___11_1.png?raw=true)

## Performance of the Models Based on Accuracy Scores
- Logistic Regression: 0.6213
- Random Forest: 0.9920
- Gradient Boosting: 0.9288
- AdaBoost: 0.8000
- Extra Trees: 0.9915
- XGBoost: 0.9912
- CatBoost: 0.9871
- LightGBM: 0.9888

## Conclusion
The Random Forest model achieved the highest accuracy on the validation set, making it the best model for this task. This model was used to make predictions on the test dataset.

## Predictions
The predictions were made using the Random Forest model and saved in `submission.csv`.

## Signature
- **Name:** Aditya D
- **GitHub:** [https://www.github.com/adi271001](https://www.github.com/adi271001)
- **LinkedIn:** [https://www.linkedin.com/in/aditya-d-23453a179/](https://www.linkedin.com/in/aditya-d-23453a179/)
- **Topmate:** [https://topmate.io/aditya_d/](https://topmate.io/aditya_d/)
- **Twitter:** [https://x.com/ADITYAD29257528](https://x.com/ADITYAD29257528)
Loading

0 comments on commit 6185c4c

Please sign in to comment.