Wine certification includes physiochemical tests like determination of density, pH, alcohol quantity, fixed and volatile acidity, etc. We have a large dataset having the physiochemical tests, results, and quality on the scale of 1 to 10 (which is further transformed between 0,1 and 2) of wines of the Vinho Verde variety. Such a model can be used not only by the certification bodies but also by the wine producers to improve quality based on the physicochemical properties and by the consumers to predict the quality of wines.
The dataset can be found at the UCI Machine Learning Repository (also available on Kaggle). Wines are divided into 2 categories, white wines and red wines. This analysis is concerned with White Wine and is based on the 12 variables/characteristics presented in the dataset :
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
- Output variable (based on sensory data): quality (score between 0, 1 and 2)
The goal is to explore the Wine Quality dataset in order to extract the main features and characteristics from the data and predict the wine quality. We will consider this problem as a classification task. Tasks performed are:
- Initial Visual Analysis
- Data preprocessing
- Gathering Training and Testing Data
- Exploratory Data Analysis
- Steps to improve prediction results
- Quality prediction
Before you continue, ensure you have met the following requirements:
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Scikit-learn
- K Nearest Neighbors (KNN)
- Random Forest Classifier
- Decision Tree
- Stochastic Gradient Descent (SGD)
- Random Forest Classifier