From c8a82c22b046e2a4ad2dbeac5fa8dc18da705f58 Mon Sep 17 00:00:00 2001
From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com>
Date: Tue, 5 Nov 2024 12:03:15 +0530
Subject: [PATCH 1/2] Fixes Mushroom Classification

---
 .../mushroom-classification-notebook.ipynb    | 1020 +++++++++++++++++
 1 file changed, 1020 insertions(+)
 create mode 100644 Prediction Models/Mushroom_Classification/mushroom-classification-notebook.ipynb

diff --git a/Prediction Models/Mushroom_Classification/mushroom-classification-notebook.ipynb b/Prediction Models/Mushroom_Classification/mushroom-classification-notebook.ipynb
new file mode 100644
index 00000000..4860f39c
--- /dev/null
+++ b/Prediction Models/Mushroom_Classification/mushroom-classification-notebook.ipynb	
@@ -0,0 +1,1020 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Mushroom Classification\n",
+    "\n",
+    "## Modelling Objective\n",
+    "Build a **Simple** and **Interpretable** Model to Perform **Binary Classification** on Edibility of Mushroom from *Agarcius and Lepiota Family*. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:45.398443Z",
+     "iopub.status.busy": "2021-06-04T17:10:45.398122Z",
+     "iopub.status.idle": "2021-06-04T17:10:45.403054Z",
+     "shell.execute_reply": "2021-06-04T17:10:45.401917Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:45.398415Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import scipy.stats as ss\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "from collections import Counter\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Reading Dataset\n",
+    "\n",
+    "The [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/mushroom) includes descriptions of hypothetical samples corresponding to 23\n",
+    "species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). \n",
+    "\n",
+    "Each species is identified as definitely *edible, definitely poisonous, or of unknown\n",
+    "edibility and not recommended*. This latter class was combined with the poisonous\n",
+    "one. \n",
+    "\n",
+    "Hence, the task given is a binary classification problem whereby, \n",
+    "given the features of mushrooms, we are to classify the mushrooms \n",
+    "into **p=Poisonous** or **e=edible**.\n",
+    "\n",
+    "## Data Dictionary\n",
+    "| Columns | Descriptions |\n",
+    "| :--- | :--- |\n",
+    "| class                   | poisonous=p, edible=e| \n",
+    "| cap-shape               | bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s|\n",
+    "| cap-surface             | fibrous=f,grooves=g,scaly=y,smooth=s |\n",
+    "| cap-color               | brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y|\n",
+    "| bruises                 | bruises=t,no=f |\n",
+    "| odor                    | almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s |\n",
+    "| gill-attachment         | attached=a,descending=d,free=f,notched=n|\n",
+    "| gill-spacing            | close=c,crowded=w,distant=d|\n",
+    "| gill-size               | broad=b,narrow=n |\n",
+    "| gill-color              | black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y|\n",
+    "| stalk-shape             | enlarging=e,tapering=t\n",
+    "| stalk-root              | bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,**missing=?** |\n",
+    "| stalk-surface-above-ring| fibrous=f,scaly=y,silky=k,smooth=s|\n",
+    "| stalk-surface-below-ring| fibrous=f,scaly=y,silky=k,smooth=s|\n",
+    "| stalk-color-above-ring  | brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y|\n",
+    "| stalk-color-below-ring  | brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y|\n",
+    "| veil-type               | partial=p,universal=u|\n",
+    "| veil-color              | brown=n,orange=o,white=w,yellow=y|\n",
+    "| ring-number             | none=n,one=o,two=t|\n",
+    "| ring-type               | cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z|\n",
+    "| spore-print-color       | black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y|\n",
+    "| population              | abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y|\n",
+    "| habitat                 | grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d|"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:45.410235Z",
+     "iopub.status.busy": "2021-06-04T17:10:45.409894Z",
+     "iopub.status.idle": "2021-06-04T17:10:45.458726Z",
+     "shell.execute_reply": "2021-06-04T17:10:45.457527Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:45.410206Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "mushroom_df = pd.read_csv(\"../input/mushroom-classification/mushrooms.csv\", \n",
+    "    na_values=\"?\", # masking \"?\" with Null Values\n",
+    "    )\n",
+    "mushroom_df.rename(columns = {\"class\":\"is-edible\"}, inplace = True)\n",
+    "mushroom_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exploratory Data Analysis\n",
+    "Understands the dataset and flag out flaws in the dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Descriptive Summaries\n",
+    "By running `.info()` on our dataframe, the following are the initial observation\n",
+    "of the dataset.\n",
+    "\n",
+    "**Observations**\n",
+    "\n",
+    "1. The shape of dataset is `(8124, 23)` whereby there is 8124 observations and 23 columns. \n",
+    "(22 Features + 1 Target Variable: `\"is-edible\"`)\n",
+    "2. Datatype of all columns are `object`. However, from the documentation there are numerical feature which is encoded as string. **(e.g. ring-number)**\n",
+    "3. Missing values is observed in `\"stalk-root\"` columns which is around 30.5% (2480 Missing Values) of the entire dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:45.461125Z",
+     "iopub.status.busy": "2021-06-04T17:10:45.460712Z",
+     "iopub.status.idle": "2021-06-04T17:10:45.48465Z",
+     "shell.execute_reply": "2021-06-04T17:10:45.483776Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:45.461082Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "mushroom_df.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Unique Values Exploration\n",
+    "Since all our columns are encoded in string, one way to explore the values is \n",
+    "number of unique values in each columns.\n",
+    "\n",
+    "**Observations**\n",
+    "\n",
+    "1. Constant Value Column(1 Unique Value): \n",
+    "\n",
+    "    `\"veil-type\"`\n",
+    "    \n",
+    "    As all datapoints have constant value = p in `\"veil-type\"`, it does not provide \n",
+    "    any information to the target variable.\n",
+    "    \n",
+    "    > One approach is to **Drop the `\"veil-type\"` column entirely**.\n",
+    "\n",
+    "2. Binary Columns(2 Unique Values): \n",
+    "\n",
+    "    `[\"is-edible\"(label), \"bruises\", \"gill-attachment\", \"gill-spacing\", \"gill-size\", \"stalk-shape\"]`\n",
+    "\n",
+    "3. Nominal Categorical Columns(>2 Unique Values):\n",
+    "    \n",
+    "    `[\"cap-shape\", \"cap-surface\", \"cap-color\", \"odor\", \"gill-color\",\n",
+    "       \"stalk-root\", \"stalk-surface-above-ring\", \"stalk-surface-below-ring\",\n",
+    "       \"stalk-color-above-ring\", \"stalk-color-below-ring\", \"veil-color\", \"ring-type\", \n",
+    "       \"spore-print-color\", \"population\", \"habitat\"]`\n",
+    "\n",
+    "    There are 15 Nominal Categorical which describe the characteristics of mushrooms \n",
+    "    including the texture, colors, population and habitat.\n",
+    "\n",
+    "    > As there are abundance of Nominal Categorical features, creating One-Hot variables \n",
+    "    for all categorical features might create excessive dimensional spaces which can be \n",
+    "    computational expenssive and prone to overfitting ([aka \"The Curse of Dimensionality\"](https://towardsdatascience.com/the-curse-of-dimensionality-50dc6e49aa1e))\n",
+    "\n",
+    "    > There might be a need to explore further on feature selection and/or [dimensionality reduction](https://towardsdatascience.com/5-must-know-dimensionality-reduction-techniques-via-prince-e6ffb27e55d1) to mitigate the curse of dimensionality.\n",
+    "4. Discrete Numerical Columns(Countable Values):\n",
+    "\n",
+    "    `\"ring-number\"`\n",
+    "\n",
+    "    Although it is technically a numerical column, since the number of unique values is low(`nuique() == 3`), we can treat it as a categorical column during the encoding. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:45.485734Z",
+     "iopub.status.busy": "2021-06-04T17:10:45.485493Z",
+     "iopub.status.idle": "2021-06-04T17:10:45.518231Z",
+     "shell.execute_reply": "2021-06-04T17:10:45.517145Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:45.485711Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "mushroom_df.nunique().sort_values()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Countplot\n",
+    "Countplot is a convenient tool to quickly explore the count of category variables for each unique values.\n",
+    "Since the dataset entire dataset is made up of categorical variables, we can just generate countplot for all columns.\n",
+    "\n",
+    "**Observations**\n",
+    "\n",
+    "1. Balanced Label\n",
+    "\n",
+    "     The class frequency of the target variable `is-edible` is relatively balanced with 4208 instances classified as edible and 3916 instances classified as poisonous.\n",
+    "     \n",
+    "2. High Cardinality for Categorical Features\n",
+    "\n",
+    "     For features >2 Unique values, most of them suffer from high cardinality with minority classes. This makes the column of resulting matrix sparse if we were to perform One-Hot Encoding without any feature selection/dimension reduction.\n",
+    "     "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:45.52001Z",
+     "iopub.status.busy": "2021-06-04T17:10:45.519665Z",
+     "iopub.status.idle": "2021-06-04T17:10:52.34298Z",
+     "shell.execute_reply": "2021-06-04T17:10:52.341965Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:45.519967Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(15,10))\n",
+    "for i, col in enumerate(mushroom_df.columns):\n",
+    "    sns.set_palette(sns.color_palette(\"Paired\"))\n",
+    "    ax = plt.subplot(6,4,i+1)\n",
+    "    sns.countplot(\n",
+    "        x=col, data = mushroom_df, ax = ax, \n",
+    "        order = mushroom_df[col].value_counts(ascending=True).index\n",
+    "    )\n",
+    "    sns.set_style('whitegrid')\n",
+    "    plt.xticks(rotation=90)\n",
+    "    plt.ylabel(\"Median Price\")\n",
+    "    plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Summary:**\n",
+    "\n",
+    "1. Constant Value Column exist for `veil-type` which shall be dropped as it does not bring any information of the target variable.\n",
+    "2. Data Cleaning/ Imputation is needed to treat missing values for `stalk-root`.\n",
+    "3. Encode the features into dummy variables for binary column(Unique Values = 2) and One-Hot encoding for nominal categorical columns(Unique Values >=2).\n",
+    "4. Feature Selection might be required to reduce the dimension of the dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Data Preprocessing\n",
+    "Preprocess dataset into a format that is digestible by model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Drop Constant Value Column\n",
+    "\n",
+    "`veil-type` column is dropped as it have constant value of \"p\" which does not bring any information about the target variable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:52.345878Z",
+     "iopub.status.busy": "2021-06-04T17:10:52.345595Z",
+     "iopub.status.idle": "2021-06-04T17:10:52.3518Z",
+     "shell.execute_reply": "2021-06-04T17:10:52.350764Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:52.34585Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "mushroom_df.drop(columns=\"veil-type\", inplace = True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Cleaning and Imputation\n",
+    "\n",
+    "Since there are around 30.2% of missing values observed for `stalk-root` feature, we can tryout the following approaches:\n",
+    "\n",
+    "1. Drop the Entire `stalk-root` Column\n",
+    "2. Impute with Central Tendency(most-frequent = \"b\")\n",
+    "3. Impute with Advanced Algorithm in SKLearn (e.g. IterativeImputer, KNNImputer)\n",
+    "\n",
+    "We will go with the first approach since it is the simplest solution that does not change the underlying distribution of dataset and evaluate the decision based on the model's performance later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:52.355466Z",
+     "iopub.status.busy": "2021-06-04T17:10:52.355108Z",
+     "iopub.status.idle": "2021-06-04T17:10:52.375412Z",
+     "shell.execute_reply": "2021-06-04T17:10:52.373877Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:52.355429Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "mushroom_df.dropna(axis=1, inplace=True)\n",
+    "\n",
+    "print(\"stalk-root\" in mushroom_df.columns)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:52.377259Z",
+     "iopub.status.busy": "2021-06-04T17:10:52.376897Z",
+     "iopub.status.idle": "2021-06-04T17:10:52.382523Z",
+     "shell.execute_reply": "2021-06-04T17:10:52.381458Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:52.377228Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from sklearn.feature_selection import chi2, RFECV\n",
+    "from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV\n",
+    "from sklearn.dummy import DummyClassifier\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.neighbors import KNeighborsClassifier\n",
+    "from sklearn.svm import SVC, LinearSVC\n",
+    "from sklearn.tree import DecisionTreeClassifier, plot_tree\n",
+    "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
+    "from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, auc, roc_curve, plot_roc_curve"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train-Test-Split\n",
+    "Splitting Data into Training and Testing Set before in-depth EDA and Preprocessing to avoid data leakage and ensures all decisions make are based on the training set and the testing set is left untouched."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:52.383831Z",
+     "iopub.status.busy": "2021-06-04T17:10:52.38358Z",
+     "iopub.status.idle": "2021-06-04T17:10:52.400667Z",
+     "shell.execute_reply": "2021-06-04T17:10:52.399538Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:52.383807Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "train_df, test_df = train_test_split(mushroom_df, test_size = 0.3, random_state = 12)\n",
+    "print(train_df.shape)\n",
+    "print(test_df.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Feature Selection\n",
+    "Although we have a total of 107 features after feature encoding, some of the feature might not be useful for modelling as it have too little occurrence, or they are just noises that does not bring any information about the target variable.\n",
+    "\n",
+    "For that, feature selection is needed to investigate more on the strong and weak features and how we could perform some feature engineering before we start our modelling.\n",
+    "\n",
+    "*All investigation and inference is made with the training set to minimize any data leakage which leads to biased result during model evaluation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Cramer's V Correlation Matrix\n",
+    "Cramer's V is a statistical test to calculate correlation in tables which have more than 2x2 rows and columns. It is used as post-test to determine strengths of association after chi-square has determined significance. \n",
+    "\n",
+    "$$\n",
+    "V = \\sqrt{\\frac{\\chi^2/n}{k-1}}\\\\\n",
+    "\\chi^2 : \\text{chi-square}\\\\\n",
+    "k : \\text{number of rows or columns in the contingency table}\\\\\n",
+    "n : \\text{Number of observations}\n",
+    "\\\\\n",
+    "(Weak Association)0<V<1(Strong Association)\n",
+    "$$\n",
+    "\n",
+    "Reference : [Cramer's V correlation matrix](https://www.kaggle.com/chrisbss1/cramer-s-v-correlation-matrix)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:52.402699Z",
+     "iopub.status.busy": "2021-06-04T17:10:52.402339Z",
+     "iopub.status.idle": "2021-06-04T17:10:52.677575Z",
+     "shell.execute_reply": "2021-06-04T17:10:52.676686Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:52.402656Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def cramers_V(var1,var2) :\n",
+    "    crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building\n",
+    "    chi2 = ss.chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test\n",
+    "    n = np.sum(crosstab) # Number of observations\n",
+    "    phi2 = chi2/n\n",
+    "    r, k = crosstab.shape\n",
+    "    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))\n",
+    "    rcorr = r - ((r-1)**2)/(n-1)\n",
+    "    kcorr = k - ((k-1)**2)/(n-1)\n",
+    "    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))\n",
+    "rows= []\n",
+    "for var in mushroom_df.columns:\n",
+    "    cramers =cramers_V(train_df['is-edible'], train_df[var]) # Cramer's V test\n",
+    "    rows.append(round(cramers,2))\n",
+    "cramers_results = np.array(rows)\n",
+    "cramers_V_matrix = pd.DataFrame(cramers_results,columns = ['is-edible'] ,index = mushroom_df.columns)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:52.679079Z",
+     "iopub.status.busy": "2021-06-04T17:10:52.678655Z",
+     "iopub.status.idle": "2021-06-04T17:10:52.989401Z",
+     "shell.execute_reply": "2021-06-04T17:10:52.988709Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:52.679037Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize = (9,7))\n",
+    "sns.barplot(y=\"index\", x=\"is-edible\", data = cramers_V_matrix[['is-edible']].sort_values('is-edible').reset_index(), color = \"red\")\n",
+    "plt.title(\"Cramer's V Correlation to is-edible\")\n",
+    "plt.show()\n",
+    "display(cramers_V_matrix[['is-edible']].sort_values('is-edible'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the Cremer's V statistical test, the following are the observations:\n",
+    "\n",
+    "1. Odor seems to be a strong measure as the association between odor with the target variable, is-edible is very high.\n",
+    "2. There seems to be little association between `['stalk-shape', 'gill-attachment', 'veil-color', 'cap-surface', 'ring-number', 'cap-color', 'cap-shape']` with target variable, `\"is-edible\"`.\n",
+    "       \n",
+    "    This might be due to the present of some minority classes(values with little observation) or there are simply no association.\n",
+    "\n",
+    "To test out the hypothesis above, we print out the contingency table for features with low Cremer's V score and perform some basic visualisation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:52.990773Z",
+     "iopub.status.busy": "2021-06-04T17:10:52.990381Z",
+     "iopub.status.idle": "2021-06-04T17:10:53.117315Z",
+     "shell.execute_reply": "2021-06-04T17:10:53.11634Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:52.990733Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "cols = ['stalk-shape', 'gill-attachment', 'veil-color', 'cap-surface', 'ring-number', 'cap-color', 'cap-shape']\n",
+    "for col in cols:\n",
+    "    display(pd.crosstab(train_df['is-edible'], train_df[col]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By analysing the printed cross-tab the following are the observations:\n",
+    "\n",
+    "1. For the features that are flagged as weak association to the target variable, some of the values seems to be a strong split and can make up a strong feature once One-Hot Encoded (e.g. gill-attachment, veil-color, ring-number, cap-shape)\n",
+    "2. However, there are also features where all the values are ambiguous in classifying the target variable.(e.g. stalk-shape, cap-surface) Although there might be some hidden relationship when we take account of combination for more than one features, we first attempt to **drop `[\"stalk-shape\", \"cap-surface\"]` features** from our dataset and revisit the decision after modelling."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:53.118678Z",
+     "iopub.status.busy": "2021-06-04T17:10:53.118423Z",
+     "iopub.status.idle": "2021-06-04T17:10:53.126782Z",
+     "shell.execute_reply": "2021-06-04T17:10:53.125799Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:53.118654Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "train_df.drop(columns = [\"stalk-shape\", \"cap-surface\"], inplace = True)\n",
+    "test_df.drop(columns = [\"stalk-shape\", \"cap-surface\"], inplace = True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Feature Encoding (Dummies Encoding + One-Hot Encoding)\n",
+    "For starter, we will perform one-hot encoding for Nominal Categorical Columns(>2 Unique Values) to encode our features and target variable into values of 0 and 1.\n",
+    "Dummies Encoding is then used to encode Binary Columns(Unique Values = 2) with `drop=first` flag to avoid high correlation between encoded features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:53.128683Z",
+     "iopub.status.busy": "2021-06-04T17:10:53.128305Z",
+     "iopub.status.idle": "2021-06-04T17:10:53.244812Z",
+     "shell.execute_reply": "2021-06-04T17:10:53.243611Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:53.128643Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "binary_col = train_df.columns[train_df.nunique() == 2]\n",
+    "categorical_col = [col for col in train_df.columns if col not in binary_col]\n",
+    "\n",
+    "dummy_encode_train = pd.get_dummies(train_df[binary_col], drop_first= True, prefix_sep=\"-\") #dummy encode for binary features\n",
+    "onehot_encode_train = pd.get_dummies(train_df[categorical_col], drop_first= False, prefix_sep=\"-\") #onehot encode for categorical features\n",
+    "\n",
+    "dummy_encode_test = pd.get_dummies(test_df[binary_col], drop_first= True, prefix_sep=\"-\") #dummy encode for binary features\n",
+    "onehot_encode_test = pd.get_dummies(test_df[categorical_col], drop_first= False, prefix_sep=\"-\") #onehot encode for categorical features\n",
+    "\n",
+    "train_onehot = pd.concat([dummy_encode_train, onehot_encode_train], axis = 1)\n",
+    "test_onehot = pd.concat([dummy_encode_test, onehot_encode_test], axis = 1)\n",
+    "\n",
+    "display(train_onehot)\n",
+    "display(test_onehot)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Splitting Features and Target Variable\n",
+    "After one-hot encoding, we split our training and testing set into X_train_raw, X_test_raw, y_train, y_test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:53.246621Z",
+     "iopub.status.busy": "2021-06-04T17:10:53.246243Z",
+     "iopub.status.idle": "2021-06-04T17:10:53.290873Z",
+     "shell.execute_reply": "2021-06-04T17:10:53.290059Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:53.246576Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "X_train_raw, y_train = train_onehot.drop(columns = 'is-edible-p'), train_onehot['is-edible-p']\n",
+    "X_test_raw, y_test = test_onehot.drop(columns = 'is-edible-p'), test_onehot['is-edible-p']\n",
+    "display(X_train_raw)\n",
+    "display(X_test_raw)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Recursive Feature Elimination\n",
+    "As we have 101 features after feature encoding, to get a simpler model, we make use of Recursive Feature Elimination with Support Vector Machine running linear kernel to perform feature selection for us by ranking the features by the score generated by LinearSVM which can find the best linear split with maximum margin, which might be helpful for us in the case of Binary Classification.\n",
+    "\n",
+    "After running the feature selection process, rfe reduced the number of features recommended into just 10 features, which we will be using the selected features for modelling later on."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:10:53.292388Z",
+     "iopub.status.busy": "2021-06-04T17:10:53.291941Z",
+     "iopub.status.idle": "2021-06-04T17:11:35.783886Z",
+     "shell.execute_reply": "2021-06-04T17:11:35.782844Z",
+     "shell.execute_reply.started": "2021-06-04T17:10:53.292344Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "svm = SVC(kernel='linear') #LinearSVC\n",
+    "min_features_to_select = 1  # Minimum number of features to consider\n",
+    "rfecv = RFECV(estimator=svm, step=1, cv=5,\n",
+    "              scoring='accuracy',\n",
+    "              min_features_to_select=min_features_to_select)\n",
+    "rfecv.fit(X_train_raw, y_train)\n",
+    "\n",
+    "print(\"Optimal number of features : %d\" % rfecv.n_features_)\n",
+    "\n",
+    "# Plot number of features VS. cross-validation scores\n",
+    "plt.figure()\n",
+    "plt.xlabel(\"Number of features selected\")\n",
+    "plt.ylabel(\"Cross validation score (nb of correct classifications)\")\n",
+    "plt.plot(range(min_features_to_select,\n",
+    "               len(rfecv.grid_scores_) + min_features_to_select),\n",
+    "         rfecv.grid_scores_)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:35.785522Z",
+     "iopub.status.busy": "2021-06-04T17:11:35.785245Z",
+     "iopub.status.idle": "2021-06-04T17:11:35.792305Z",
+     "shell.execute_reply": "2021-06-04T17:11:35.791206Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:35.785495Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "X_train = X_train_raw.loc[:,rfecv.ranking_==1]\n",
+    "X_test = X_test_raw.loc[:,rfecv.ranking_==1]\n",
+    "print(X_train.shape)\n",
+    "print(X_test.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Modeling\n",
+    "After we have performed feature selection to limit down the number of features from 107 features to just 10 features, we are ready to make use of some sklearn model to see whether can we find some simple yet interpretable model that can help us classify whether is a mushroom poisonous.\n",
+    "\n",
+    "We will first begin by exploring several simple statistical model with default parameters and make decision based on the evaluation outcome and also the interpretability"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Baseline Classifier\n",
+    "We use Dummy Classifier by predicting the most dominant classes : Poisonous = False and evaluate the score of the baseline predictor as reference point for model selection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:35.793841Z",
+     "iopub.status.busy": "2021-06-04T17:11:35.793577Z",
+     "iopub.status.idle": "2021-06-04T17:11:35.807774Z",
+     "shell.execute_reply": "2021-06-04T17:11:35.806753Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:35.793816Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "dummy = DummyClassifier()\n",
+    "dummy.fit(X_train, y_train)\n",
+    "print(\"Baseline Accuracy Score :{:.4f}\".format(dummy.score(X_test, y_test)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Selection"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:35.809229Z",
+     "iopub.status.busy": "2021-06-04T17:11:35.808929Z",
+     "iopub.status.idle": "2021-06-04T17:11:35.821032Z",
+     "shell.execute_reply": "2021-06-04T17:11:35.819972Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:35.809199Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Defining Utility Function \n",
+    "def evaluate_model(models, X_train, X_test, y_train, y_test):\n",
+    "    hist = {}\n",
+    "    n_models = len(models)\n",
+    "    fig, axes = plt.subplots(1 + n_models // 4 ,4, figsize = (4*4, 2 + 4*(n_models//4)))\n",
+    "    \n",
+    "    for idx, model in tqdm(enumerate(models)):\n",
+    "        try:\n",
+    "            clf = model(random_state=12) # Setting random_state for certain model\n",
+    "        except:\n",
+    "            clf = model()\n",
+    "        clf.fit(X_train, y_train)\n",
+    "        yhat_train = clf.predict(X_train)\n",
+    "        acc_train = accuracy_score(y_train, yhat_train)\n",
+    "        f1_train = f1_score(y_train, yhat_train)\n",
+    "        \n",
+    "        # 5-Fold CV\n",
+    "        cv_hist = cross_validate(clf, X_train, y_train, scoring=['accuracy', 'f1', 'roc_auc'])\n",
+    "\n",
+    "        # Record down the performance\n",
+    "        hist[model.__name__] = dict(\n",
+    "            train_acc = acc_train,\n",
+    "            cv_acc = cv_hist['test_accuracy'].mean(),\n",
+    "            train_f1_score = f1_train,\n",
+    "            cv_f1_score = cv_hist['test_f1'].mean(),\n",
+    "            cv_auc = cv_hist['test_roc_auc'].mean()\n",
+    "        )\n",
+    "\n",
+    "        # Plotting AUC ROC Curve with Test Set *Without taking any reference for Model Selection\n",
+    "        plot_roc_curve(clf, X = X_test, y = y_test, ax=axes[idx//4,idx%4])\n",
+    "       \n",
+    "    plt.tight_layout()\n",
+    "    display(pd.DataFrame(hist).T)\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:35.822711Z",
+     "iopub.status.busy": "2021-06-04T17:11:35.822411Z",
+     "iopub.status.idle": "2021-06-04T17:11:41.564327Z",
+     "shell.execute_reply": "2021-06-04T17:11:41.563308Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:35.822682Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "models = [DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, LogisticRegression, LinearSVC, KNeighborsClassifier]\n",
+    "evaluate_model(models, X_train, X_test, y_train, y_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the accuracy report, we noticed that all the models performs quite well out-of-the-box(without hyperparameter tuning) with accuracy and auc of 1.0 for most model except Logistic Regression.\n",
+    "\n",
+    "As the goal of modelling is to find the simplest model, with high interpretability, that can describe the relationship between features and target label, we continue by looking into `DecisionTreeClassifier` and perform some hyperparameter tuning to see can we get a simpler model that is still able to generalize to the sample data well."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Evaluation : Logistic Regression & Decision Tree Classifier"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:41.565624Z",
+     "iopub.status.busy": "2021-06-04T17:11:41.565351Z",
+     "iopub.status.idle": "2021-06-04T17:11:41.854789Z",
+     "shell.execute_reply": "2021-06-04T17:11:41.853654Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:41.565597Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "logex = LogisticRegression()\n",
+    "logex.fit(X_train, y_train)\n",
+    "yhat = logex.predict(X_test)\n",
+    "hist = cross_validate(logex,X_train,y_train,cv=5)\n",
+    "acc = accuracy_score(y_test, yhat)\n",
+    "print(\"Logistic Regression\")\n",
+    "print(\"Cross Validation Score:\", pd.DataFrame(hist)[[\"test_score\"]].describe().T[[\"mean\",\"std\"]], sep=\"\\n\")\n",
+    "print(\"Test Set Accuracy Score: {:.4f}\".format(acc))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:41.856877Z",
+     "iopub.status.busy": "2021-06-04T17:11:41.856256Z",
+     "iopub.status.idle": "2021-06-04T17:11:41.957307Z",
+     "shell.execute_reply": "2021-06-04T17:11:41.956121Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:41.856829Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "tree = DecisionTreeClassifier(random_state=12)\n",
+    "tree.fit(X_train, y_train)\n",
+    "yhat = tree.predict(X_test)\n",
+    "hist = cross_validate(tree,X_train,y_train,cv=5)\n",
+    "acc = accuracy_score(y_test, yhat)\n",
+    "print(\"Cross Validation Score:\", pd.DataFrame(hist)[[\"test_score\"]].describe().T[[\"mean\",\"std\"]], sep=\"\\n\")\n",
+    "print(\"Test Set Accuracy Score: {:.4f}\".format(acc))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the evaluation metrics, we observed accuracy of 1.0 during K-Fold Cross Validation and also for Hold-Out Test Set, which has not been seen by the algorithm. Hence, we continue by carrying out hyperparameter tuning to further simplify the model for better interpretability."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Hyperparameter Tuning"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:41.959373Z",
+     "iopub.status.busy": "2021-06-04T17:11:41.958947Z",
+     "iopub.status.idle": "2021-06-04T17:11:44.696522Z",
+     "shell.execute_reply": "2021-06-04T17:11:44.695381Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:41.959331Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "model = LogisticRegression()\n",
+    "space = dict(\n",
+    "    penalty = ['l2', 'l1'],\n",
+    "    C = np.logspace(-4,4,5),\n",
+    "    n_jobs = [-1]\n",
+    ")\n",
+    "clf = GridSearchCV(model, space, n_jobs=-1, cv = 5)\n",
+    "clf.fit(X_train, y_train)\n",
+    "print(\"Best Parameters for Logistic Regression Classifier:\\n{}\".format(clf.best_params_))\n",
+    "print(\"Best Score : \\n{}\".format(clf.best_score_))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:44.699065Z",
+     "iopub.status.busy": "2021-06-04T17:11:44.698372Z",
+     "iopub.status.idle": "2021-06-04T17:11:46.12012Z",
+     "shell.execute_reply": "2021-06-04T17:11:46.119238Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:44.699018Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "model = DecisionTreeClassifier(random_state=12)\n",
+    "space = dict(\n",
+    "    criterion = [\"gini\",\"entropy\"],\n",
+    "    max_depth = np.arange(1,11),\n",
+    "    max_features = [None, \"auto\", \"sqrt\", \"log2\"]\n",
+    ")\n",
+    "clf = GridSearchCV(model, space, n_jobs=-1, cv = 5)\n",
+    "clf.fit(X_train, y_train)\n",
+    "print(\"Best Parameters for Decision Tree Classifier:\\n{}\".format(clf.best_params_))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since our modelling objective is Simple and Interpretable, I have decided to go for Decision Tree Classifier due to its ability to generate an easily interpretable result in form of decision map."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Model Intepretation\n",
+    "By using the best parameters generated from Grid Search Cross-validation, we train our final model and gain some insights by visualising the decision rule of the decision tree."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:46.122186Z",
+     "iopub.status.busy": "2021-06-04T17:11:46.12153Z",
+     "iopub.status.idle": "2021-06-04T17:11:46.137192Z",
+     "shell.execute_reply": "2021-06-04T17:11:46.135661Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:46.122136Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "final_model = DecisionTreeClassifier(max_depth= 5, max_features= None,random_state=12)\n",
+    "final_model.fit(X_train, y_train)\n",
+    "print(\"Test Accuracy Score : {:.4f}\".format(final_model.score(X_test,y_test)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Feature Importance\n",
+    "From the feature importance we can tell that out of the 10 selected features, only 8 of them are used in the decision rule and `odor-n` which represents `No-odor`, is the best-feature with highest feature importance.\n",
+    "\n",
+    "Meanwhile, \"Black Stalk Surface Above Ring, Woods Habitat\" is not utilise based on the decision rule.\n",
+    "\n",
+    "Hence, we can conclude that by using only 8 true or false values, we can effectively classify whether is a mushroom poisonous or edible."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:46.139614Z",
+     "iopub.status.busy": "2021-06-04T17:11:46.13915Z",
+     "iopub.status.idle": "2021-06-04T17:11:46.165258Z",
+     "shell.execute_reply": "2021-06-04T17:11:46.163655Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:46.139571Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pd.DataFrame(list(zip(X_train.columns, final_model.feature_importances_)), columns = ['features', 'importance']).sort_values('importance', ascending = False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Visualising Decision Tree"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2021-06-04T17:11:59.566534Z",
+     "iopub.status.busy": "2021-06-04T17:11:59.566191Z",
+     "iopub.status.idle": "2021-06-04T17:12:01.657884Z",
+     "shell.execute_reply": "2021-06-04T17:12:01.653921Z",
+     "shell.execute_reply.started": "2021-06-04T17:11:59.566505Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(figsize = (12,9), dpi=500)\n",
+    "plot_tree(final_model, feature_names = X_train.columns, fontsize=7)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Conclusion\n",
+    "From dataset with 22 categorical columns and more than 100 dummies features, we managed to spot and rectify the errors(Single-Valued Columns, Null Values) from the dataset and perform feature selection to limit down the number of features into just 10 before building the final model which only utilise only 8 features to achieve perfect classification based on the evaluation score from cross-validation and hold-out test set."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

From edf76f884c96fe92f1f3ef9fa15a488801c4682f Mon Sep 17 00:00:00 2001
From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com>
Date: Tue, 5 Nov 2024 12:04:33 +0530
Subject: [PATCH 2/2] Create Readme.md

---
 .../Mushroom_Classification/Readme.md         | 62 +++++++++++++++++++
 1 file changed, 62 insertions(+)
 create mode 100644 Prediction Models/Mushroom_Classification/Readme.md

diff --git a/Prediction Models/Mushroom_Classification/Readme.md b/Prediction Models/Mushroom_Classification/Readme.md
new file mode 100644
index 00000000..d1c17a07
--- /dev/null
+++ b/Prediction Models/Mushroom_Classification/Readme.md	
@@ -0,0 +1,62 @@
+# Mushroom Classification using Machine Learning
+
+This repository contains a machine learning model to classify mushrooms as either edible or poisonous based on their features. By analyzing attributes like color, shape, and odor, the model aims to identify toxic mushrooms, providing a practical tool for foragers and researchers.
+
+## Table of Contents
+
+- [Problem Statement](#problem-statement)
+- [Project Overview](#project-overview)
+- [Classification Models](#classification-models)
+- [Dataset](#dataset)
+- [Preprocessing](#preprocessing)
+- [Training and Evaluation](#training-and-evaluation)
+- [Results](#results)
+- [Usage](#usage)
+- [Future Work](#future-work)
+- [Contributing](#contributing)
+- [License](#license)
+
+## Problem Statement
+
+Mushroom poisoning can have serious health consequences. Identifying poisonous mushrooms traditionally requires expertise, making it challenging for laypersons. This project aims to develop a machine learning model that predicts whether a mushroom is poisonous or edible based on its physical characteristics, providing an accessible tool for safe foraging and ecological studies.
+
+## Project Overview
+
+This project applies several machine learning models to:
+1. **Train on a dataset of mushroom features**: Features like cap color, gill size, and odor are used as predictors.
+2. **Classify mushrooms as edible or poisonous**: Output is a binary classification indicating toxicity.
+
+## Classification Models
+
+The project compares multiple classification algorithms to determine the most effective:
+- **Logistic Regression**: Serves as a baseline binary classifier.
+- **Decision Trees**: Captures non-linear relationships among features.
+- **Random Forest**: An ensemble model enhancing decision tree performance.
+- **K-Nearest Neighbors (KNN)**: Classifies based on feature similarity.
+
+## Dataset
+
+The dataset contains mushroom attributes relevant to classification, such as:
+- **Cap shape and color**
+- **Gill attachment and size**
+- **Odor and habitat**
+
+Each entry is labeled as either **edible** or **poisonous**.
+
+## Preprocessing
+
+Preprocessing includes:
+1. **Encoding**: Transforming categorical variables into numerical format.
+2. **Normalization**: Scaling features for uniform model input.
+3. **Splitting**: Dividing data into training, validation, and testing sets.
+
+## Training and Evaluation
+
+- **Training**: Each model is trained on labeled mushroom data using cross-entropy loss.
+- **Evaluation Metrics**: Accuracy, precision, recall, and F1-score are used to evaluate model performance on unseen data.
+
+## Results
+
+The models show varying degrees of accuracy, with Random Forest and Decision Trees achieving the highest accuracy in classifying mushrooms as edible or poisonous.
+
+