- Group Project Presentation — Google Slides
- Github
- Machine Learning Model — Ordinary Least Squares
- Database — PostgreSQL
- Dashboard — Tableau
Diamonds have a variety of measurable characteristics that contribute to the overall value of the diamond itself. This study is a supervised learning problem because each diamond and it's features are associated with a target price. As a group of data scientists, we are interested in analyzing the measurements and classifications of a diamond to understand the numerical value in the year 2022.
The presentation will outline a synopsis of what our overall project signifies. Our reasoning for topic selection, source of data, questions to answer, and data exploration/analysis can be explained below.
- Analyzing the measurements and classifications of diamonds
- Provided insight on how diamond modifications affect pricing
- Intended audience: - Mining Company (Raw Supplier) - Jeweler/Cutter - Jewelry Appraisers (prices the diamond) - Distributors (Jeweler) - Consumer (YOU!)
- We chose this topic because it gave us a greater idea of changes for diamonds
- Diamonds are a very good resource, and is highly demanded
- Round cut diamonds are the most heavily requested diamond cut
- Link to DiamondSE.info
- Generation engine that allows for diamond price comparisons. You can update different features that you want (say a heart shaped, 5 carat diamond with a D color grade, the site will populate jeweler/price/etc. for you)
- Link to Kaggle
- Diamond price CVS dataset was used within our project to determine needed criteria
- Kaggle site was the best source that had the most features and data outlined for diamonds (round cut) within a specific time frame that we were looking for.
- Why are diamonds so expensive?
- What is the purpose of having so many features to justify the price of a diamond?
- How does the measurement and classifications of a diamond impact its numerical value?
- Dataset obtained from kaggle was sufficient, as each column within the dataset was crucial to our overall process
- As a team, we aligned on using the following as model features: carat, cut, color, clarity, table_depth, table_width, x, y, and z.
- All of the following features in the dataset are well defined and are not dependent of each other. - get_dummies function was used to convert categorical features (cut, color, and clarity) to numerical features for model input.
- The data is well formatted.
- It is recommended that the field 'depth' is renamed 'table_depth' for clarity.
- It is recommended that the field 'table' is renamed 'table_width' for clarity.
- All the features in the dataset are well delineated and independant of each other. As a team, we agree to use
carat
,cut
,color
,clarity
,table_depth
,table_width
,x
,y
, andz
as model features. - We utilized the sklearn.model_selection library to import then split, train and test the data. This allowed 75% of our data to be training data and 25% to be test data. - Utilized Oridnary Least Square to scale the features prior to modeling. - Linear regression is a classic method to model our dataset, and it is more understandable to our intended audience and less expensive computationally than other modeling methods.
- Languages used: Python/Jupyter Notebook/PostgresSQL
- Algorithm used: sklearn.model_selection, R-Squared and Ordinary Leasing Square model to scale the features and split/train/import data
- Mentioned in Machine Learning section below.
- Mentioned in Machine Learning section below.
- An issue we ran into concerned the use of PostgresSQL database through PGadmin for our project, involving the fact that the database was local. To correct this on any future project we would want to upgrade the database to AWS, which would open the database up to better communication options.
- Within our data source DiamondSE.info, we would've tried to scrape more round cut diamonds, and scrape for the additional feature diamond shape (e.g. Marquise, Heart, or Princess).
- Additonal model refinement of the model is mentioned in Machine Learning section below.
Link to Google Slides.
Through the course of this project we had to merge 4 seperate GitHub accounts. Chauntel created the repository to host the code for the final project. We aligned on naming the repository final-project. After Chauntel created the repository, she invited Graeme, Suchitra, and Matthew to collaborate. In total we had to have 17 branches and 4 commits per branch (per team member). Through this process we learned to: merge on the master branch, perform quality assurance, test the code, and navigate through merge conflicts as a team.
Note: The descriptions and explanations required in all other project deliverables should also be in your README.md as part of your outline, unless otherwise noted. Individual Branches
(12 total commits per person)
Provisional data was obtained from the kaggle dataset Diamonds Prices created by Ms. Nancy Al Aswad, which was scraped from the Loose diamonds search engine at DiamondSearchEngine on July 9, 2022. The dataset contains 53,943 records of round-cut diamonds. (Round-cut diamonds represent about 72% of all diamonds listed on DiamondSE.info.) Each record has nine features (carat
, cut
, color
, clarity
, depth
, table
, x
, y
, z
) and one target (price
).
carat
ranges from 0.2 to 5.01. 1 carat = 200 mg.cut
grade has five categories: Fair, Good, Very Good, Premium, and Ideal.color
has seven categories: D/E/F/G/H/I/J.clarity
has eight categories: IF/VVS1/VVS2/VS1/VS2/SI1/SI2/I1. There are no flawless (FL) diamonds in this dataset.depth
is the table depth which ranges from 0% to 90%.table
is the table width which ranges from 0% to 90%.- Regarding the 'depth' and 'table' features, please see Understanding Diamond Table and Depth.
x
is the length of the diamond in mm.y
is the width of the diamond in mm.z
is the depth of the diamond in mm.price
is the price of the diamond in $USD set by the jeweler.
All of the EDA was done via the pandas_profiling ProfileReport module. An interactive report was generated in EDA.ipynb which provided an efficient and thorough means to explore the data set.
One of the main observations to come out of the EDA was that price varies quite linearly with
- From here through the rest of the Machine Learning section the machine learning analysis was performed in OLS_modeling.ipynb.
- There are no missing values to address.
- There are zeros in x, y, and z, which can be considered missing numbers (no diamonds have a zero length, width, or depth). All y zeros have x zeros, and all x zeros have z zeros.
- All rows (8 rows) with x zeros were removed.
- All rows with non-zero x and y values, but zero z values (12 rows) had z values calculated:
z = (depth / 100) * (x + y) / 2
.
- Feature selection by binning occurs at this step (see details below).
- For modelling, the categorical variables need to be converted to numerical variables. Here we used the get_dummies() function, which replaces the original categorical column with as many new numerical columns as there were categories.
- For example,
cut
has five categories, Fair, Good, Very Good, Premium, and Ideal. The columncut
is removed, and five new columns are added:cut_Fair
,cut_Good
,cut_Very Good
,cut_Premium
, andcut_Ideal
. If in rowi
the original value ofcut
was Good, then in rowi
new columncut_Good
is assigned a value of 1, and the other columns are assigned a value of 0. - The categorical columns
color
andclarity
are treated likewise. - In this way, the three original categorical columns are replaced by 20 new numerical columns, increasing the model feature count by 17.
- For example,
- The new numerical columns generated by get_dummies() were reordered by diamond trait quality in descending order for ease of analysis.
cut_Ideal
,cut_Premium
,cut_Very Good
,cut_Good
,cut_Fair
color_D
,color_E
,color_F
,color_G
,color_H
,color_I
,color_J
clarity_IF
,clarity_VVS1
,clarity_VVS2
,clarity_VS1
,clarity_VS2
,clarity_SI1
,clarity_SI2
,clarity_I1
Description of feature engineering and the feature selection, including their decision making process
Initially there is no feature selection. All features (dummy variables are considered features) are included in the model, as there is no reason to exclude any feature.
There are several feature selections that can be made by binning. The purpose of binning is to reduce feature complexity in the model, removing relatively rare features by combining them together into a single category.
- Although there are only 5 cut categories, one could bin together the two lowest frequency categories, GOOD (9.1%) and FAIR (3.0%).
- Although there are only 7 color categories, one could bin together the two highest color lowest frequency categories I (10.1%) and J (5.2%).
- Although there are only 8 clarity categories, one could bin together the two lowest frequency clarities, I1 (1.4%) and IF (3.3%), not with each other, but with their nearest clarity category. That is, bin I1 and SI2 (17.0%), and bin IF and VVS1 (6.8%).
After model training, it may be the case that some features do not contribute meaningfully to the model (see below as to how to assess significance of fit), and removal of those features may improve model performance.
In order to assess the performance of a machine learning model, all the data is split into a training set and a testing set. The two sets are mutually exclusive. The model is then trained with the training set data. The important consideration is the accuracy of predictions obtained when the trained model is exposed to unseen test data. Why?
One can choose a very flexible model that when trained fit the training data very well. In some cases one could choose a "perfect model" that passes through every training data point! However, when exposed to unseen test data, this "perfect model" does poorly at predicting unseen test data. This condition is known as overfitting.
In the end, we want a model that can most accurately predict new or unseen data.
In practice, experience has shown that the best results can be obtained by having the size of the training data set be larger than that of the test data set. Here we used the sklearn.model_selection train_test_split() function to split the data as 75% training and 25% testing.
The data in this study was not scaled for the ordinary least squares (OLS) multiple linear regression model.
For some machine learning models, data scaling improves model training and subsequent predictive performance.
This is not the case for OLS multiple linear regression modelling. Moreover, by not scaling the data, the feature coefficients of the trained model can be used to interpret the trained model fit.
The data was modelled using ordinary least-squares (OLS) multiple linear regression. This model was chosen because the data presents a supervised learning problem (diamond price is the model target), the target price is a continuous variable (not a discrete classification), there are no anticipated complicated patterns to the data that require a more complex/deep model, the model is well-understood, the model allows understandable predictions to be made, and the model is computationally fast.
The equations for the ordinary least-squares multiple linear regression model are:
-
$$ y_i = \beta_0 + \beta_1 x_{i,1} + \beta_{i,2} x_{i,2} + ... + \beta_j x_{i,j} + ... + \beta_p x_{i,p} + \epsilon_i $$
- there are
$i = 1$ to$n$ observations (rows, data points) - there are
$p$ features - observation
$i$ is the data point$(y_i, x_{i,1}, x_{i,2}, ..., x_{i,j})$ - the
$\beta_j$ are feature coefficients -
$x_{i,j}$ is observation$i$ of feature$j$ -
$\epsilon_i = y_i - \hat{y}_i$ is the error (residual) of observation$i$
- there are
-
$$ R^2 = 1 - RSS/TSS $$
-
$R^2$ is the residual sum of squares (coefficient of determination). It is the measure of accuracy of OLS models which predict continuous target estimates ($\hat{y}_i$ ).$R^2$ is the fraction of the variance in the data that can be explained by the OLS model.
-
-
$$ TSS = \Sigma (y_i - \overline{y})^2 $$
-
$TSS$ is the total sum of squares, i.e. the sum of the squares of the distances between the observed$y_i$ values and the mean$\overline{y}$ value. The mean$\overline{y}$ can be considered the estimate of the data without OLS modelling.
-
-
$$ RSS = \Sigma (y_i - \hat{y}_i)^2 $$
-
$$ RSS = \epsilon_1^2 + \epsilon_2^2 + ... + \epsilon_i^2 + ... + \epsilon_n^2 $$
- equations (4) and (5) are equivalent
-
$RSS$ is the residual sum of squares, i.e. the sum of the squares of the distances between the observed$y_i$ values and the$\hat{y}_i$ values estimated by the OLS model. - the OLS model finds all
$\beta_j$ that minimizes$RSS$
The initial choice of the OLS multiple linear regression model did not change.
Description of how they have trained the model thus far, and any additional training that will take place
The model was trained (with the training data) on all 26 features described above with diamond price
as the target. Two linear regression implementations were used, because each offers features that the other does not.
- The statsmodels.api OLS() function was used as it allows easy calculation of p-values for training feature coefficients.
- The sklearn.linear_model LinearRegression() function was used as it allows easy calculation of the testing
$R^2$ quantity. Both implementations gave the same interpretations of the data.
The confusion matrix is used to tell you how many predictions were classified correctly or incorrectly. The OLS linear regression used to model to the data in this study predicts a continous target output, not a discrete classification. Therefore, a confusion matrix is not applicable. The accuracy of the OLS linear regression model is characterized by the
There are two important considerations from the modelling. Do the features used contribute significantly to the model? What is the model accuracy?
Each training feature coefficient has an associated p-value. A p-value above 0.05 means the null hypothesis (the feature does not significantly contribute to the model) is not rejected.
It can be seen that every feature significantly contributes to the model except for y
and color_I
.
The
The database we decided to use was PGAdmin. PGAdmin is simple to use and manipulate the needed data provided by our csv. PGAdmin allows us to have a clean and clear interface which helps us in building the necessary queries. PGAmin was chosen over other databases because it was more manageable to integrate information into and out of.
Database interfaces with the project in some format (e.g., scraping updates the database, or database connects to the model)
-
After we created a main database within PostgresSQL 11 (named "Diamonds_Price_Data"), we completed the following:
- We performed this function: SELECT DISTINCT * for the following features: cut, clarity, color. This function allowed us to get the specific, non-numerical categories with those specified columns.
- We were able to determine that:
- Cut column has 5 categories
- Color column has 7 categories
- Clarity column has 8 categories
- Came to the conclusion that since cut had a higher impact on diamond price, and fewer category options, determined that cut would be the best choice for table comparisons.
- Reminder, cut choices are: Fair, Good, Very Good, Ideal, and Premium
- Created 5 tables, based on the cut of the diamond
- Used full outer join to merge together two tables: fair and premium, and good and very good.
- Decided to use Full Outer Join, because it would return all matching records from both tables whether the other tables' values completely match or not. Even though the categories have different values within the different features, the columns are the same (making it easier to merge).
- Used connection string, SQLAlchemy to migrate our data
- Choose SQLAlchemy because is very simple to implement and allows us to easily allow communication between python programs and PGAdmin.
-
Utilizing the online tool from QuickData for building an ER diagram, we were able to produce a schema for our single table database and used PGadmin within a postgresql database to store the data locally.
-
Starting with two separate tables based on diamond numeric values (ie. carat, dimensions, price, etc) and characteristic values (ie. cut quality, color designation, and clarity designation), these tables were join, or combined, into a single full diamond data table.
-
The two images below are the ERDs related to the tables create in our chosen database, PGAdmin. The tables displayed within each of the charts are centered around the "diamond_prices.csv" file. Each of the tables created are based on the cut of the diamond: fair, good, very good, ideal, and premium.
Fig. This image is of the multiple tables created. Note that the primary key used is ID.
Fig. This image showcases the tables created, utilizing the outer full join
-
Taking advantage of SQLAlchemy the full table "" is pulled into the machine learning script.
The image above is of our dashboard which can be found at the following link. The dashboard contains interactive tables built from the original data pre-analysis, as well as the Training and Testing R-squared figures and the coeficient table from the ML model. The tables can be adjusted through select "Cut", "Carat", or "Price".