Restaurant Rating Prediction

Predicting the aggregate rating of Zomato restaurants using Machine Learning.

Tools used: Python (Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn, Dython)

Sections:

Data Analysis and Cleaning
Feature Engineering and Preprocessing
Model building and Tuning
Results

Summary:

Visualizing the data using Seaborn and Dython, engineering new features like cuisines and restaurant names, using Random Forest Regressor from Scikit-learn for prediction, visualizing and evaluating model performance, and finally tuning it using RandomizedSearchCV to further improve performance.

Final model with a Mean Absolute Error of around 0.183. We will also evaluate other metrics like MSE, RMSE, MDAE and MAPE.

Value: This model can help restaurant aggregators and food delivery companies like Zomato in predicting the aggregate rating of new restaurants, or existing unrated ones. That can in turn increase customer satisfaction.

Data Analysis and Cleaning

First, we import the libraries that we will be using:

import numpy as np
import pandas as pd
import seaborn as sns
from dython import nominal
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.core.display import HTML    # To centralize the plots
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Importing the data:

df = pd.read_csv('zomato.csv', encoding='ISO-8859-1') # Specifying the encoding is important or it will raise UTF error

Let's get to know our data:

df.shape

(9551, 21)

So we have 9551 rows and 21 columns.

Let's see the columns:

df.columns

Index(['Restaurant ID', 'Restaurant Name', 'Country Code', 'City', 'Address',
       'Locality', 'Locality Verbose', 'Longitude', 'Latitude', 'Cuisines',
       'Average Cost for two', 'Currency', 'Has Table booking',
       'Has Online delivery', 'Is delivering now', 'Switch to order menu',
       'Price range', 'Aggregate rating', 'Rating color', 'Rating text',
       'Votes'],
      dtype='object')

Let's take a look at the first 5 rows of the dataset, to get an idea of the data:

pd.set_option('display.max_columns',21)
df.head()

	Restaurant ID	Restaurant Name	Country Code	City	Address	Locality	Locality Verbose	Longitude	Latitude	Cuisines	Average Cost for two	Currency	Has Table booking	Has Online delivery	Is delivering now	Switch to order menu	Price range	Aggregate rating	Rating color	Rating text	Votes
0	6317637	Le Petit Souffle	162	Makati City	Third Floor, Century City Mall, Kalayaan Avenu...	Century City Mall, Poblacion, Makati City	Century City Mall, Poblacion, Makati City, Mak...	121.027535	14.565443	French, Japanese, Desserts	1100	Botswana Pula(P)	Yes	No	No	No	3	4.8	Dark Green	Excellent	314
1	6304287	Izakaya Kikufuji	162	Makati City	Little Tokyo, 2277 Chino Roces Avenue, Legaspi...	Little Tokyo, Legaspi Village, Makati City	Little Tokyo, Legaspi Village, Makati City, Ma...	121.014101	14.553708	Japanese	1200	Botswana Pula(P)	Yes	No	No	No	3	4.5	Dark Green	Excellent	591
2	6300002	Heat - Edsa Shangri-La	162	Mandaluyong City	Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...	Edsa Shangri-La, Ortigas, Mandaluyong City	Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...	121.056831	14.581404	Seafood, Asian, Filipino, Indian	4000	Botswana Pula(P)	Yes	No	No	No	4	4.4	Green	Very Good	270
3	6318506	Ooma	162	Mandaluyong City	Third Floor, Mega Fashion Hall, SM Megamall, O...	SM Megamall, Ortigas, Mandaluyong City	SM Megamall, Ortigas, Mandaluyong City, Mandal...	121.056475	14.585318	Japanese, Sushi	1500	Botswana Pula(P)	No	No	No	No	4	4.9	Dark Green	Excellent	365
4	6314302	Sambo Kojin	162	Mandaluyong City	Third Floor, Mega Atrium, SM Megamall, Ortigas...	SM Megamall, Ortigas, Mandaluyong City	SM Megamall, Ortigas, Mandaluyong City, Mandal...	121.057508	14.584450	Japanese, Korean	1500	Botswana Pula(P)	Yes	No	No	No	4	4.8	Dark Green	Excellent	229

Let's take a closer look at the columns:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu  9551 non-null   object 
 16  Price range           9551 non-null   int64  
 17  Aggregate rating      9551 non-null   float64
 18  Rating color          9551 non-null   object 
 19  Rating text           9551 non-null   object 
 20  Votes                 9551 non-null   int64  
dtypes: float64(3), int64(5), object(13)
memory usage: 1.5+ MB

Observation: It seems Cuisines has some null values. We'll take a look at those.

df.describe() # Looking at just the numerical columns

	Restaurant ID	Country Code	Longitude	Latitude	Average Cost for two	Price range	Aggregate rating	Votes
count	9.551000e+03	9551.000000	9551.000000	9551.000000	9551.000000	9551.000000	9551.000000	9551.000000
mean	9.051128e+06	18.365616	64.126574	25.854381	1199.210763	1.804837	2.666370	156.909748
std	8.791521e+06	56.750546	41.467058	11.007935	16121.183073	0.905609	1.516378	430.169145
min	5.300000e+01	1.000000	-157.948486	-41.330428	0.000000	1.000000	0.000000	0.000000
25%	3.019625e+05	1.000000	77.081343	28.478713	250.000000	1.000000	2.500000	5.000000
50%	6.004089e+06	1.000000	77.191964	28.570469	400.000000	2.000000	3.200000	31.000000
75%	1.835229e+07	1.000000	77.282006	28.642758	700.000000	2.000000	3.700000	131.000000
max	1.850065e+07	216.000000	174.832089	55.976980	800000.000000	4.000000	4.900000	10934.000000

Looks like no restaurant has a full 5 stars rating. Interesting.

Now let's take a look at the null values of our columns:

sns.heatmap(df.isnull().sum().values.reshape(-1,1), \
            annot=True, cmap=plt.cm.Blues, yticklabels=df.columns)
plt.xlabel('Null Values')
plt.show()

Observation: Cuisines has 9 null values.

Since we can't determine what cuisines a restaurant has from the other features in the dataset, we will just drop these null values.

df.dropna(inplace=True)

There. Let's take a look at the null counts again, just to check:

sns.heatmap(df.isnull().sum().values.reshape(-1,1), \
            annot=True, cmap=plt.cm.Blues, yticklabels=df.columns)
plt.xlabel('Null Values')
plt.show()

Perfect.

There is something interesting about the Switch to order menu column:

df['Switch to order menu']

0       No
1       No
2       No
3       No
4       No
        ..
9546    No
9547    No
9548    No
9549    No
9550    No
Name: Switch to order menu, Length: 9542, dtype: object

df['Switch to order menu'].value_counts()

No    9542
Name: Switch to order menu, dtype: int64

Observation: Switch to order menu has no other value than 'No'.

Since that is not much use for us, we are going to drop it.

df.drop('Switch to order menu', axis=1, inplace = True)

Since once of the categorical columns turned out to be useless for us, it makes sense to also take a look at the rest of them:

df.columns

Index(['Restaurant ID', 'Restaurant Name', 'Country Code', 'City', 'Address',
       'Locality', 'Locality Verbose', 'Longitude', 'Latitude', 'Cuisines',
       'Average Cost for two', 'Currency', 'Has Table booking',
       'Has Online delivery', 'Is delivering now', 'Price range',
       'Aggregate rating', 'Rating color', 'Rating text', 'Votes'],
      dtype='object')

df['Restaurant Name'].value_counts()

Cafe Coffee Day             83
Domino's Pizza              79
Subway                      63
Green Chick Chop            51
McDonald's                  48
                            ..
The Town House Cafe          1
The G.T. Road                1
The Darzi Bar & Kitchen      1
Smoke On Water               1
Walter's Coffee Roastery     1
Name: Restaurant Name, Length: 7437, dtype: int64

df.Locality.value_counts().value_counts() # Remember, we can specify a column both as df['column'] and df.column

1      550
2      172
3      103
4       51
5       42
      ... 
44       1
45       1
50       1
51       1
122      1
Name: Locality, Length: 82, dtype: int64

df['Has Table booking'].value_counts()

No     8384
Yes    1158
Name: Has Table booking, dtype: int64

df['Has Online delivery'].value_counts()

No     7091
Yes    2451
Name: Has Online delivery, dtype: int64

df['Is delivering now'].value_counts()

No     9508
Yes      34
Name: Is delivering now, dtype: int64

df.City.value_counts()

New Delhi         5473
Gurgaon           1118
Noida             1080
Faridabad          251
Ghaziabad           25
                  ... 
Lincoln              1
Lakeview             1
Lakes Entrance       1
Inverloch            1
Panchkula            1
Name: City, Length: 140, dtype: int64

Observation: So, all of these columns do have more than one value. That means they could actually be useful.

Now we are going to use the Dython library to make a correlation plot of all the features. What I like about this library is that it lets you easily plot the correlation between both categorical and continuous features, something that is not easy to do with Pandas.

nominal.associations(df,figsize=(20,10),mark_columns=True,title="Correlation Matrix") # correlation matrix
plt.show()

Feature Engineering and Preprocessing

If we look at the Aggregate rating (con) row, we can see how correlated it is with the rest of the features.

The first highly correlated feature is the Restaurant name (nom) column, with 95%. Let's take a look at this column and see what we can do.

print( f"Total number of restaurants:    {df['Restaurant Name'].value_counts().shape[0]}")
print(f"Restaurants with 1 value count: {(df['Restaurant Name'].value_counts() == 1).sum()}")

Total number of restaurants:    7437
Restaurants with 1 value count: 6703

That's a lot of restaurants. and a lot of them also value count of just 1.

We won't be able to include all of these in a model. So let's just pick the top 10.

df['Restaurant Name'].value_counts().head(10)

Cafe Coffee Day     83
Domino's Pizza      79
Subway              63
Green Chick Chop    51
McDonald's          48
Keventers           34
Pizza Hut           30
Giani               29
Baskin Robbins      28
Barbeque Nation     26
Name: Restaurant Name, dtype: int64

Now we are going to define a function to get dummies just for these 10 restaurants. Dummies are columns with values 0 and 1; 0 meaning false and 1 meaning true.

So, for example, if we make a dummy column for "Cafe Coffee Day", the rows in the dummy column will have 1 as the value if the restaurant's name is 'Cafe Coffee Day', and 0 if not.

def dummy(rest_name,column):
    df[column] = df['Restaurant Name'].apply(lambda x: 1 if str(x).strip()==rest_name\
                                             else 0)

dummy('Cafe Coffee Day','cafe_coffee_day')

Here is a visual example to see how the columns look:

df.loc[df['cafe_coffee_day']==1].head(3)

	Restaurant ID	Restaurant Name	Country Code	City	Address	Locality	Locality Verbose	Longitude	Latitude	Cuisines	Average Cost for two	Currency	Has Table booking	Has Online delivery	Is delivering now	Price range	Aggregate rating	Rating color	Rating text	Votes	cafe_coffee_day
932	9650	Cafe Coffee Day	1	Faridabad	SCF 42, Shopping Centre, Main Huda Market, Sec...	Sector 15	Sector 15, Faridabad	77.323611	28.395267	Cafe	450	Indian Rupees(Rs.)	No	No	No	1	3.3	Orange	Average	67	1
1126	8590	Cafe Coffee Day	1	Ghaziabad	1st Floor, Shipra Mall, Gulmohar Road, Indirap...	Shipra Mall, Indirapuram	Shipra Mall, Indirapuram, Ghaziabad	77.370208	28.634047	Cafe	450	Indian Rupees(Rs.)	No	No	No	1	3.2	Orange	Average	63	1
1283	631	Cafe Coffee Day	1	Gurgaon	Upper Ground Floor, DLF Mega Mall, DLF Phase 1...	DLF Mega Mall, DLF Phase 1	DLF Mega Mall, DLF Phase 1, Gurgaon	77.093595	28.475489	Cafe	450	Indian Rupees(Rs.)	No	No	No	1	2.6	Orange	Average	27	1

Wherever the Restaurant Name column's value is "Cafe Coffee Day", the value of the cafe_coffee_day column is 1.

We will apply this function for all of the 10 most frequent restaurants:

def dum_col(x):
    return x.strip().lower().replace(' ','_')

def dummy(lst,column):
    for i in lst.index:
        df[dum_col(i)] = df[column].apply(lambda x: i in x)

restaurants = df['Restaurant Name'].value_counts().head(10)
dummy(restaurants,'Restaurant Name')

df.head()

	Restaurant ID	Restaurant Name	Country Code	City	Address	Locality	Locality Verbose	Longitude	Latitude	Cuisines	...	cafe_coffee_day	domino's_pizza	subway	green_chick_chop	mcdonald's	keventers	pizza_hut	giani	baskin_robbins	barbeque_nation
0	6317637	Le Petit Souffle	162	Makati City	Third Floor, Century City Mall, Kalayaan Avenu...	Century City Mall, Poblacion, Makati City	Century City Mall, Poblacion, Makati City, Mak...	121.027535	14.565443	French, Japanese, Desserts	...	False	False	False	False	False	False	False	False	False	False
1	6304287	Izakaya Kikufuji	162	Makati City	Little Tokyo, 2277 Chino Roces Avenue, Legaspi...	Little Tokyo, Legaspi Village, Makati City	Little Tokyo, Legaspi Village, Makati City, Ma...	121.014101	14.553708	Japanese	...	False	False	False	False	False	False	False	False	False	False
2	6300002	Heat - Edsa Shangri-La	162	Mandaluyong City	Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...	Edsa Shangri-La, Ortigas, Mandaluyong City	Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...	121.056831	14.581404	Seafood, Asian, Filipino, Indian	...	False	False	False	False	False	False	False	False	False	False
3	6318506	Ooma	162	Mandaluyong City	Third Floor, Mega Fashion Hall, SM Megamall, O...	SM Megamall, Ortigas, Mandaluyong City	SM Megamall, Ortigas, Mandaluyong City, Mandal...	121.056475	14.585318	Japanese, Sushi	...	False	False	False	False	False	False	False	False	False	False
4	6314302	Sambo Kojin	162	Mandaluyong City	Third Floor, Mega Atrium, SM Megamall, Ortigas...	SM Megamall, Ortigas, Mandaluyong City	SM Megamall, Ortigas, Mandaluyong City, Mandal...	121.057508	14.584450	Japanese, Korean	...	False	False	False	False	False	False	False	False	False	False

5 rows × 30 columns

Now we have True or False values for each of the top 10 restaurants. In python, True and False can also be written as 1 and 0.

Let's take a look at how many restaurants are named 'Cafe Coffee Day', using our new column:

print(f"Number of Cafe Coffee Day's: {df.loc[df['cafe_coffee_day']==1].size}")

Number of Cafe Coffee Day's: 2730

df.shape

(9542, 30)

Observation: So out of our 9542 different restaurants, 2730 are Cafe Coffee Day's.

Now let's take a look at the correlation between the Aggregate rating and the new columns that we have created.

features = ['Price range','Votes','Country Code','Restaurant ID','Longitude',
            'Has Table booking','Has Online delivery','cafe_coffee_day',
            "domino's_pizza",'subway','green_chick_chop',"mcdonald's",'keventers',
            'pizza_hut','giani','baskin_robbins','barbeque_nation',
            'Aggregate rating']# --> Only added to see correlation, must be removed later

nominal.associations(df[features],figsize=(20,10),mark_columns=True,\
                     title="Correlation Matrix (features)")
plt.show()

Observation: Except for barbeque_nation, the rest of the created features seem to have extremely low correlations.

Since the best practice is to keep the model simplistic and use only the best features, we are going to drop all the features except for this one.

features = ['Price range','Votes','Country Code','Restaurant ID','Longitude',
            'Has Table booking','Has Online delivery','barbeque_nation']

This is going to be our final list of features for training and testing our model.

Important Note: We are not going to include the features Rating color and Rating text in this list. Their inclusion will not result in an actually useful model.

Model Building and Tuning

Now, using these features, we are going to build models to predict our target variable.

Building

We know that predicting the Aggregate rating feature is a regression problem. Since its correlation with other features is not high enough, a linear model like Linear Regression will not be optimal.

Instead, we are going to use a Random Forest Regressor model for this problem.

First, we are going to split the data into independent variables (Features) and a dependent variable (Target).

So, our features (the columns we will use to predict):

X = pd.get_dummies(df[features])
X

	Price range	Votes	Country Code	Restaurant ID	Longitude	barbeque_nation	Has Table booking_No	Has Table booking_Yes	Has Online delivery_No	Has Online delivery_Yes
0	3	314	162	6317637	121.027535	False	0	1	1	0
1	3	591	162	6304287	121.014101	False	0	1	1	0
2	4	270	162	6300002	121.056831	False	0	1	1	0
3	4	365	162	6318506	121.056475	False	1	0	1	0
4	4	229	162	6314302	121.057508	False	0	1	1	0
...	...	...	...	...	...	...	...	...	...	...
9546	3	788	208	5915730	28.977392	False	1	0	1	0
9547	3	1034	208	5908749	29.041297	False	1	0	1	0
9548	4	661	208	5915807	29.034640	False	1	0	1	0
9549	4	901	208	5916112	29.036019	False	1	0	1	0
9550	2	591	208	5927402	29.026016	False	1	0	1	0

9542 rows × 10 columns

Our target (the column we want to predict):

y = df['Aggregate rating']

Now, we want to split them into train and test sets.

We will use the train set to train the model, and the test set to test the performance of the model.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

Next, we will import the model that we want to use, i.e, RandomForestRegressor:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 2)

Now we fit the train sets into the model, and use it to predict the test set:

rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

And check its performance using the test and the prediction sets:

from sklearn import metrics
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = metrics.mean_squared_error(y_test,y_pred,squared=False)
mae = metrics.mean_absolute_error(y_test, y_pred)
medae = metrics.median_absolute_error(y_test, y_pred)


print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Median Absolute Error (MEDAE): {medae}")
print(f'Test variance: {np.var(y_test)}')

Mean Squared Error (MSE): 0.08397072865374543
Root Mean Squared Error (RMSE): 0.28977703265397936
Mean Absolute Error (MAE): 0.18649607124148773
Median Absolute Error (MEDAE): 0.11999999999999966
Test variance: 2.2502005690560023

Let's plot the residuals:

residuals = y_test - y_pred
# plot the residuals
plt.scatter(np.linspace(0,5,1909), residuals,c=residuals,cmap='magma', edgecolors='black', linewidths=.1)
plt.colorbar(label="Quality", orientation="vertical")
# plot a horizontal line at y = 0
plt.hlines(y = 0,
xmin = 0, xmax=5,
linestyle='--',colors='black')
# set xlim
plt.xlim((0, 5))
plt.xlabel('Aggregate Rating'); plt.ylabel('Residuals')
plt.show()

A residual is the difference between the observed value of the target and the predicted value. The closer the residual is to 0, the better job our model is doing.

print(f"Error range: {residuals.max()-residuals.min()}")

Error range: 2.7820000000000076

So our prediction's error range is around 2.782.

Tuning

Now we are going to run RandomizedSearchCV to tune the model by improving the hyperparameters.

# from sklearn.model_selection import RandomizedSearchCV

# n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# max_features = ['auto', 'sqrt']
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# min_samples_split = [2, 5, 10]
# min_samples_leaf = [1, 2, 4]
# bootstrap = [True, False]

# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}

# rf2 = RandomForestRegressor(random_state=2)

# rf_rscv = RandomizedSearchCV(estimator=rf2, param_distributions=random_grid,\
#                              n_iter = 100, cv = 3, verbose=2, random_state=2, n_jobs = -1)
# rf_rscv.fit(X_train,y_train)
# print(rf_rscv.best_params_)

# Output:
#      n_estimators= 1200,
#      min_samples_split= 10,
#      min_samples_leaf= 1,
#      max_depth = 30,
#      bootstrap= True,
#      random_state=2

A hyperparameter is a machine learning parameter whose value is chosen before a learning algorithm is trained. It has an impact on the model's performance.

Now we are going to use these hyperparameters to make a new Random Forests model, fit the data into it and then score it:

rf_random = RandomForestRegressor(
      n_estimators= 1200,
      min_samples_split= 10,
      min_samples_leaf= 1,
      max_depth = 30,
      max_features='sqrt',
      bootstrap= True,
      random_state=2) # Best RandomizedSearch parameters

rf_random.fit(X_train,y_train)
random_pred = rf_random.predict(X_test)

random_mse = metrics.mean_squared_error(y_test, random_pred)
random_rmse = metrics.mean_squared_error(y_test, random_pred, squared=False)
random_mae = metrics.mean_absolute_error(y_test, random_pred)
random_medae = metrics.median_absolute_error(y_test, random_pred)

print(f"Mean Squared Error (MSE): {random_mse}")
print(f"Root Mean Squared Error (RMSE): {random_rmse}")
print(f"Mean Absolute Error (MAE): {random_mae}")
print(f"Median Absolute Error (MEDAE): {random_medae}")
print(f'Test variance: {np.var(y_test)}')

Mean Squared Error (MSE): 0.07950896506171087
Root Mean Squared Error (RMSE): 0.2819733410478921
Mean Absolute Error (MAE): 0.18367410616812943
Median Absolute Error (MEDAE): 0.1146495007615349
Test variance: 2.2502005690560023

print('Improvements:')
print(f"Mean Squared Error (MSE):       {mse} => {random_mse}")
print(f"Root Mean Squared Error (RMSE): {rmse} => {random_rmse}")
print(f"Mean Absolute Error (MAE):      {mae} => {random_mae}")
print(f"Median Absolute Error (MEDAE):  {mae} => {random_medae}")
print(f'Test variance: {np.var(y_test)}')

Improvements:
Mean Squared Error (MSE):       0.08397072865374543 => 0.07950896506171087
Root Mean Squared Error (RMSE): 0.28977703265397936 => 0.2819733410478921
Mean Absolute Error (MAE):      0.18649607124148773 => 0.18367410616812943
Median Absolute Error (MEDAE):  0.18649607124148773 => 0.1146495007615349
Test variance: 2.2502005690560023

There is decrease in the model's errors.

We can also run GridSearchCV on the parameters around these to maybe tune the model further. But we are done with model tuning for this project.

Results

Let's plot the residuals for this final model:

f_residuals = y_test - random_pred
# plot the residuals
plt.scatter(np.linspace(0,5,1909), f_residuals, c = f_residuals, cmap='magma', edgecolors='black', linewidths=.1)
plt.colorbar(label = "Quality", orientation = "vertical")
# plot a horizontal line at y = 0
plt.hlines(y = 0, xmin = 0, xmax = 5, linestyle = '--', colors = 'black')
# set xlim
plt.xlim((0, 5))
plt.xlabel('Aggregate Rating'); plt.ylabel('Residuals')
plt.show()

print(f"Errors range of first model: {residuals.max() - residuals.min()}")
print(f"Errors range of second model: {f_residuals.max() - f_residuals.min()}")
print(f"Error difference of models: {(residuals.max() - residuals.min()) - (f_residuals.max() - f_residuals.min())}")

Errors range of first model: 2.7820000000000076
Errors range of second model: 2.554883167428941
Error difference of models: 0.2271168325710664

When compared to the previous model (with default hyperparameters), our final model has a .227 reduction in range of error.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
readme_files		readme_files
Analysis spyder.py		Analysis spyder.py
Country-Code.csv		Country-Code.csv
Model building.py		Model building.py
Zomato final.csv		Zomato final.csv
readme.ipynb		readme.ipynb
readme.md		readme.md
zomato.csv		zomato.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Restaurant Rating Prediction

Data Analysis and Cleaning

Feature Engineering and Preprocessing

Model Building and Tuning

Building

Tuning

Results

FIN

About

Releases

Packages

Languages

You-sha/Restaurant-Ratings-Prediction

Folders and files

Latest commit

History

Repository files navigation

Restaurant Rating Prediction

Data Analysis and Cleaning

Feature Engineering and Preprocessing

Model Building and Tuning

Building

Tuning

Results

FIN

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages