-
Notifications
You must be signed in to change notification settings - Fork 0
/
Uzoigwe ReCell Project using MachineLearning.py
1118 lines (784 loc) · 39.5 KB
/
Uzoigwe ReCell Project using MachineLearning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""SLF_Project_LearnerNotebook_FullCode.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1Ah-sv5d7CYdtA7lhgWxduVwKqJ1XbvQe
# Supervised Learning - Foundations Project: ReCell
## Problem Statement
### Business Context
Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \\$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
### Objective
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
### Data Description
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
- brand_name: Name of manufacturing brand
- os: OS on which the device runs
- screen_size: Size of the screen in cm
- 4g: Whether 4G is available or not
- 5g: Whether 5G is available or not
- main_camera_mp: Resolution of the rear camera in megapixels
- selfie_camera_mp: Resolution of the front camera in megapixels
- int_memory: Amount of internal memory (ROM) in GB
- ram: Amount of RAM in GB
- battery: Energy capacity of the device battery in mAh
- weight: Weight of the device in grams
- release_year: Year when the device model was released
- days_used: Number of days the used/refurbished device has been used
- normalized_new_price: Normalized price of a new device of the same model in euros
- normalized_used_price: Normalized price of the used/refurbished device in euros
## Importing necessary libraries
"""
#These libraries help with data manipulation and reading.
import numpy as np
import pandas as pd
# These Libraries help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# The goal of this library is to split the data into tests and trains.
from sklearn.model_selection import train_test_split
# Library for building linear regression_model
from sklearn.linear_model import LinearRegression
# Library to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Library to build linear regression_model using statsmodels
import statsmodels.api as sm
# Library to compute VIF for Multicolineary check
from statsmodels.stats.outliers_influence import variance_inflation_factor
"""## Loading the dataset"""
#Connecting Google drive with Google colab
# Reading the data-set into Google colab
from google.colab import drive
drive.mount('/content/drive')
#Reading the "used_device_data.csv" dataset into a dataframe (i.e.loading the data)
path="/content/drive/My Drive/used_device_data.csv"
data=pd.read_csv(path)
"""## Data Overview
These can be achieved by doing the following
- Data Overview
- Viewing the first and last few rows of the dataset
- Checking the shape of the dataset
-Ensuring that the data is stored in the correct format, it's important to identify the data types.
- Getting the statistical summary for the variables
- Checking for missing values
- Checking for duplicates
### Showing the first and last five rows of the dataset
"""
# returning the first 5 rows using the dataframe head method
data.head()
# returning the last 5 rows using dataframe tail method
data.tail()
"""### Checking the dataset shape"""
#checking shape of the dataframe to find out the number of rows and columns using the dataframe shape command
print("There are", data.shape[0], 'rows and', data.shape[1], "columns.")
"""### Checking the columns data types for the dataset"""
# Using the dataframe info() method to print a concise summary of the DataFrame
data.info()
"""**Observation**
* The dataset contains 15 series (columns) of which nine of the series are of the float datatype (screen_size, main_camera_mp, selfie_camera_mp, int_memory, ram, battery, weight, normalized_used_price, and normalized_new_price), two(2) of the series are of the integer datatype (release_year and days_used) and four(4) of the series are of the object datatype (brand_name, os, 4g, and 5g).
* Total memory usage is approximately 404.9 KB.
### Statistical summary of the dataset
"""
# checking the statistical summary of the data using describe command and transposing.
data.describe().T
"""**Observation**
* The size of phone screen ranges from 5.08 to 30.71 cm, with an average size around 13.71 cm and a standard deviation of 3.81 cm. The screen size of 75% of the phones are below 15.34cm. This indicates that most of the buyers of phones prefer screen sizes below 16 inches.
* The pixel of the phone main camera ranges from 0.08 to 48 mega pixels, with an average pixel around 9.48 megapixels and a standard deviation of 4.82 megapixels. The pixel of 75% of the phones are below 13.00. This indicates that most of the buyers of phones buy phones with main camera pixel below 13.00 megapixels.
* The RAM of the phones ranges from 0.02 to 12 gb, with an average byte around 4.036gb and a standard deviation of 1.37gb. The RAM of 75% of the phones are below 4gb.
* The Weight of the phones ranges from 69 to 855 grams, with average grams of 182.75grams and a standard deviation of 88.41grams. The Weight of 75% of the phones are below 185grams.
* The normalized price of used phones ranges from 1.53 to 6.62 euros, with average price of 4.36euros and a standard deviation of 0.589 euros. 75% of the price of used phones are below 4.76 euros.
* The normalized price of new phones ranges from 2.901 to 7.85 euros, with average price of 5.23 euros and a standard deviation of 0.68 euros. 75% of the price of new phones are below 5.67 euros.
### Checking for missing values
"""
# Checking for missing values
data.isnull().sum()
"""**Observation**
* Six(6) of the series (main_camera_mp(179), selfie_camera_mp (2), int_memory (4), ram (4), battery(6), and weight(7)) contains missing values.
### Checking for duplicate values
"""
# checking for duplicates using duplicate method
duplicates =data.duplicated().value_counts()
duplicates
"""**Observation**
* There are no duplicate values in the dataset
## Exploratory Data Analysis (EDA)
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
**Questions**:
1. What does the distribution of normalized used device prices look like?
2. What percentage of the used device market is dominated by Android devices?
3. The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
4. A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
6. A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
7. Which attributes are highly correlated with the normalized price of a used device?
### Univariate Analysis
### Screen Size
"""
# Copying the data will ensure that it stays the same.
df = data.copy()
# check screen_size
df['screen_size'].nunique()
df['screen_size'].mode()
"""**Observation**:
There 142 uniques screen sizes in the market
"""
from matplotlib import patches
import random
#creating a histogram and Boxplot using function
def histobox_plot(df, column, figsize=(15, 10), kde=False, bins=None):
#plt.figure(figsize = (20,10))
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)
sns.set(style="darkgrid")
# creating a figure composed of two matplotlib.Axes objects (ax_box and ax_hist)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)},figsize=figsize,)
# assigning a graph to each ax
sns.boxplot(df, x=column, ax=ax_box,showmeans=True, color="violet")
sns.histplot(data=df, x=column, ax=ax_hist)
ax_hist.axvline(
data[column].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist.axvline(
data[column].median(), color="black", linestyle="-"
) # Add median to the histogram
# Remove x axis name for the boxplot
ax_box.set(xlabel='')
for p in ax_hist.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax_hist.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
histobox_plot(df,"screen_size")
"""**Observation**
* There are 142 unique screen sizes in the market.
* 50% of the screen sizes are below 12.83cm
* The average screen size is higher than the median indicating that the distribution is a bit right-skewed
* The most frequent screen size in the market is 12.7cm.
* There are outliers
**`normalized_used_price`**
"""
# check normalized_used_price
df['normalized_used_price'].nunique()
histobox_plot(df, "normalized_used_price")
"""**Observation**
* There are 3094 unique normalized used prices in the market.
* 50% of the normalized used prices are below 4.41 euros
* The average normalized used prices is approximately equal to the median indicating that the distribution is normally distributed
* The column contains outliers
**`normalized_new_price`**
"""
# check normalized_new_price
df['normalized_new_price'].nunique()
histobox_plot(df, "normalized_new_price")
"""**Observation**
* There are 2988 unique normalized new prices in the market.
* 50% of the normalized new prices are below 5.25 euros
* The average normalized new prices is approximately equal to the median indicating that the distribution is normally distributed
* The column contains outliers
**`main_camera_mp`**
"""
# check
df['main_camera_mp'].nunique()
histobox_plot(df, "main_camera_mp")
"""**Observation**
* There are 41 unique pixels of the main camera in the market.
* 50% of the main cameras are below 8.00 megapixels
* The average main camera is higher than the median indicating that the distribution is a bit right-skewed
* The column contains outliers
**`selfie_camera_mp`**
"""
# check selfie_camera_mp
df['selfie_camera_mp'].nunique()
histobox_plot(df, "selfie_camera_mp")
"""**Observation**
* There are 37 unique pixels of the selfie camera in the market.
* 50% of the selfie cameras are below 5.00 megapixels
* The average selfie camera is higher than the median indicating that the distribution is a bit right-skewed
* The column contains outliers
**`int_memory`**
"""
# check
df['int_memory'].nunique()
histobox_plot(df, "int_memory")
"""**Observation**
* There are 15 unique internal memories in the market.
* 50% of the interna memories are below 32gb
* The average internal memory is higher than the median indicating that the distribution is a bit right-skewed
* There are outliers
**`ram`**
"""
# check ram
df['ram'].nunique()
histobox_plot(df, "ram")
"""**Observation**
* There are 15 unique RAMs in the market.
* 50% of the RAMs are below 4gb
* The average RAM is approximately equal to the median indicating that the distribution is normally distributed
* There are outliers
**`weight`**
"""
# check weight
df['weight'].nunique()
histobox_plot(df, "weight")
"""**Observation**
* There are 555 unique phone weights in the market.
* 50% of the phone weights are below 160grams
* The average weight of phones is higher than the median indicating that the distribution is a bit right-skewed
* There are outliers
**`battery`**
"""
# check battery
df['battery'].nunique()
histobox_plot(df, "battery")
"""**Observation**
* There are 324 unique battery capacity in the market.
* 50% of the battery capacity are below 3000MAh
* The average battery capacity is higher than the median indicating that the distribution is a bit right-skewed
* There are outliers
**`days_used`**
"""
# check days_used
df['days_used'].nunique()
histobox_plot(df, "days_used")
"""**Observation**
* 50% of the number of days the used/refurbished device has been used is below 690 days
* The average number of days the used/refurbished device has been used is lesser than the median indicating that the distribution is a bit left-skewed
* There are outliers
"""
# function to create barplots for automation
def barplot(data, column,perc=True):
plt.figure(figsize=(10,5))
bxp=sns.countplot(data=df,x=column)
bxp.set_xlabel(column, fontsize=14)
bxp.axes.set_title("Bar Chart Plot of "+ column.upper(), fontsize=16)
plt.xticks(rotation=90)
# label each bar in the countplot
for p in bxp.patches:
total = len(data[column]) # length of the column
height = p.get_height()
# get the height of each bar
# percentage of each class of the category # get the height of each bar
# adding text to each bar
bxp.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
for p in bxp.patches:
total = len(data[column]) # length of the column
height2 = 100 * p.get_height() / total
# get the height of each bar
# percentage of each class of the category # get the height of each bar
# adding text to each bar
bxp.text(x = p.get_x()+(p.get_width()), # x-coordinate position of data label, padded to be in the middle of the bar
y = height2+0.4, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height2)+"%", # data label, formatted to ignore decimals
ha = 'left') # sets horizontal alignment (ha) to center
plt.show()
"""**OS**"""
# check os
df['os'].nunique()
barplot(df,"os")
"""#### Observations:
* There are 4 unique Operating system (OS) in the dataset.
* The distribution of OS types show that OS are not equally distributed.
* The most frequent OS type is the Android OS
* iOS appears to be the least OS type used by the phones.
**`brand_name`**
"""
# check brand_name
df['brand_name'].nunique()
barplot(df,"brand_name")
"""#### Observations:
* There are 34 unique Brand Names in the dataset.
* The distribution of Brand names show that Brand names are not equally distributed.
* The most frequent Brand names is Others followed by Samsung
* Infinix appears to be the least Brand name of the used phones.
**`4g`**
"""
# check 4g
df['4g'].nunique()
barplot(df,"4g")
"""#### Observations:
* There are 2 unique values in the 4g column in the dataset.
* The distribution of phones with 4g and phones without 4g are not equally distributed.
* Majority of the phones 2335 (68%) has 4g available in the phone
**`5g`**
"""
# check 5g
df['5g'].nunique()
barplot(df,"5g")
"""#### Observations:
* There are 2 unique values in the 5g column in the dataset.
* The distribution of phones with 5g and phones without 5g are not equally distributed.
* Majority of the phones 3302 (96%) has 5g available in the phone
**`release_year`**
"""
# check release_year
df['release_year'].nunique()
barplot(df,"release_year")
"""#### Observations:
* There are 8 unique years in the dataset.
* The distribution of release years show that release years are not equally distributed.
* Most of the phones released were released in the year 2014(642 phones) follwed by 2013 (570 phones).
### Bivariate Analysis
**Checking Correlation between variables**
"""
plt.figure(figsize=(15,10))
# Remove column name 'release_year' since is time related
newdf1=df.drop(['release_year'], axis=1)
ax=sns.heatmap(newdf1.corr(),annot=True,cmap='Spectral',fmt=".2f", vmin=-1,vmax=1)
ax.set(title='Correlation between numeric variables')
plt.show()
"""**Observation**
* Normalized used price is strongly correlated with main camera (r=0.59), ram(r=0.52), screen size (r=0.61) battery(r=0.61) and normalized new price (r=0.83)
**The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?**
"""
plt.figure(figsize=(15,5))
bxp=sns.boxplot(data=df,x='brand_name', y='ram')
bxp.set_xlabel("Brand Name", fontsize=14)
bxp.axes.set_title("How RAM varies with Brand name ", fontsize=16)
plt.xticks(rotation=90)
plt.show()
"""** Observation**
* The variation of RAM accross the different brand names is not equally distributed
* Brands like the Celkon and Nokia phones has ram Below 4gb
* Brand names like Google and OnePlus all has RAM above 4gb
* OnePlus has 75% of its Phone RAM below 8gb
* Honor has maximum of 8gb RAM and minimum of 2gb RAM
**A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?**
"""
#In this case, we'll create a new data frame for batteries that exceed 4500mAh.
largebattery = df[df.battery > 4500]
#The count of columns and rows that contain more than 4500mAh batteries is checked.
largebattery.shape
plt.figure(figsize=(15,5))
bxp=sns.boxplot(data=df,x='brand_name', y='weight')
bxp.set_xlabel("Brand Name", fontsize=14)
bxp.axes.set_title("Relationship between brand name and weight", fontsize=16)
plt.xticks(rotation=90)
plt.show()
"""**observation**
* The variation of weight for phones and tablets offering large batteries (more than 4500 mAh) is not equally distributed
* Among the brand names with battery above 4500mAh, Apple has the highest weight.
* Karbon is one of the brand of phones with the least weight that has battery above 4500mAh
**Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?**
"""
largescreen = df[df.screen_size > 6 * 2.54]
largescreen.shape
ax=sns.countplot(data = largescreen, x = 'brand_name')
plt.xticks(rotation=90)
ax.set(title='Distribution of Brand Name with screen size greater than 6 inches')
ax.set_xlabel("Brand Name", fontsize=14)
# label each bar in the countplot
for p in ax.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
"""**Observation**
* There are 1099 phones and tablets are available across different brands with a screen size larger than 6 inches.
* Huawei has the highest number of phones and tablet with screen size larger than 6 inches followed by samsung.
**A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?**
"""
selfiecamera = df[df.selfie_camera_mp > 8]
selfiecamera.shape
ax=sns.countplot(data = selfiecamera, x = 'brand_name')
plt.xticks(rotation=90)
ax.set(title='Distribution of Brand Name with selfie camera greater than 8mp')
ax.set_xlabel("Brand Name", fontsize=14)
# label each bar in the countplot
for p in ax.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
"""**observation**
* The distribution of devices offering greater than 8MP selfie cameras across brands is not equally distributed.
* Huawei has the highest number of of devices offering greater than 8MP selfie cameras followed by Vivo.
"""
#for back camera we set to 16mp
backcamera = df[df.main_camera_mp > 16]
backcamera.shape
ax=sns.countplot(data = backcamera, x = 'brand_name')
plt.xticks(rotation=90)
ax.set(title='Distribution of Brand Name with back camera greater than 16mp')
ax.set_xlabel("Brand Name", fontsize=14)
# label each bar in the countplot
for p in ax.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
"""**observation**
* The distribution of devices offering greater than 16MP main cameras across brands is not equally distributed.
* Sony has the highest number of of devices offering greater than 16MP main cameras followed by Motorola.
**Variation of price of used devices across the years.**
"""
plt.figure(figsize=(12, 5))
sns.lineplot(df,y="normalized_used_price",x="release_year")
plt.show()
"""**observation**
* There is a linear variation of the price of used/refurbished devices accross the years. The line plot above shows that as the year advances the normalized price of used devices increases.
**The variations in the prices of used tablets and phones that support 5G and 4G networks.**
"""
plt.figure(figsize=(10, 4))
plt.subplot(121)
sns.boxplot(data=df, x="4g", y="normalized_used_price")
plt.subplot(122)
sns.boxplot(data=df, x="5g", y="normalized_used_price")
plt.show()
"""**observation**
* Price of phones with 5g network are higher than price of phones with 4g network
## Data Preprocessing
- Missing value treatment
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
**Missing value treatment**
"""
#copying df dataframe to create another dataframe
newdf = df.copy()
# Checking for missing values
newdf.isnull().sum()
cols_impute = [
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
for col in cols_impute:
newdf[col] = newdf[col].fillna(
value=newdf.groupby(['release_year','brand_name'])[col].transform("median")
)
# checking for missing values
newdf.isnull().sum()
cols_impute = [
"main_camera_mp",
"selfie_camera_mp",
"battery",
"weight",
]
for col in cols_impute:
newdf[col] = newdf[col].fillna(
value=newdf.groupby(['brand_name'])[col].transform("median")
)
# checking for missing values
newdf.isnull().sum()
newdf["main_camera_mp"] = newdf["main_camera_mp"].fillna(newdf["main_camera_mp"].median())
# checking for missing values
newdf.isnull().sum()
#Rechecking the columns datatype using .info() method)
newdf.info()
"""## EDA
- It is a good idea to explore the data once again after manipulating it.
### Checking for Outliers
"""
# outlier detection using boxplot
num_cols = newdf.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 15))
for i, variable in enumerate(num_cols):
plt.subplot(4, 3, i + 1)
sns.boxplot(data=newdf, x=variable)
plt.tight_layout(pad=2)
plt.show()
"""## Model Building - Linear Regression"""
# independent variables
X = newdf.drop(["normalized_used_price"], axis=1)
# dependent variable
y = newdf[["normalized_used_price"]]
print(X.head())
print(y.head())
# let's add the intercept to data
X = sm.add_constant(X)
# creating dummy variables
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
newdf['main_camera_mp'].unique()
"""## Split Data
**We will now split X and y into train and test sets in a 70:30 ratio.**
"""
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
print(X_train.shape[0])#Number of rows in train data
print(X_test.shape[0])#Number of rows in test data
"""## Fit Linear Model"""
olsmod = sm.OLS(y_train, X_train)
olsres = olsmod.fit()
# let's print the regression summary
print(olsres.summary())
"""## Model Performance Check
Let's check the performance of the model using different metrics.
* We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$.
* We will define a function to calculate MAPE and adjusted $R^2$.
- The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0.
* We will create a function which will print out all the above metrics in one go.
"""
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsres, X_train, y_train)
olsmodel_train_perf
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel1_train_perf = model_performance_regression(olsres, X_test, y_test)
olsmodel1_train_perf
"""## Checking Linear Regression Assumptions
- In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied.
We will be checking the following Linear Regression assumptions:
1. **No Multicollinearity**
2. **Linearity of variables**
3. **Independence of error terms**
4. **Normality of error terms**
5. **No Heteroscedasticity**
### Test for Multicolinearity
* **Variance Inflation factor**: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
* **General Rule of Thumb**:
- If VIF is 1, then there is no correlation among the $k$th predictor and the remaining predictor variables, and hence, the variance of $\beta_k$ is not inflated at all.
- If VIF exceeds 5, we say there is moderate VIF, and if it is 10 or exceeding 10, it shows signs of high multi-collinearity.
- The purpose of the analysis should dictate which threshold to use.
"""
# let's check the VIF of the predictors
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
"""#### Let's remove/drop multicollinear columns one by one and observe the effect on our predictive model."""
X_train2 = X_train.drop(["brand_name_Apple"], axis=1)
olsmod_1 = sm.OLS(y_train, X_train2)
olsres_1 = olsmod_1.fit()
print(
"R-squared:",
np.round(olsres_1.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_1.rsquared_adj, 3),
)
"""**Since there is no effect on adj. R-squared after dropping the 'brand_name_Apple' column, we can remove it from the training set.**"""
X_train = X_train.drop(["brand_name_Apple"], axis=1)
olsmod_2 = sm.OLS(y_train, X_train)
olsres_2 = olsmod_2.fit()
print(olsres_2.summary())
"""### Checking if multicollinearity is still present in the data."""
vif_series2 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series2))
X_train3 = X_train.drop(["brand_name_Huawei"], axis=1)
olsmod_2 = sm.OLS(y_train, X_train3)
olsres_2 = olsmod_2.fit()
print(
"R-squared:",
np.round(olsres_2.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_2.rsquared_adj, 3),
)
"""**Since there is no effect on adj. R-squared after dropping the 'brand_name_Huawei' column, we can remove it from the training set.**"""
X_train = X_train.drop(["brand_name_Huawei"], axis=1)
olsmod_3 = sm.OLS(y_train, X_train)
olsres_3 = olsmod_3.fit()
print(olsres_3.summary())
"""### Checking if multicollinearity is still present in the data."""
vif_series3 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series3))
X_train4 = X_train.drop(["screen_size"], axis=1)
olsmod_3 = sm.OLS(y_train, X_train4)
olsres_3 = olsmod_3.fit()
print(
"R-squared:",
np.round(olsres_3.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_3.rsquared_adj, 3),
)
X_train = X_train.drop(["screen_size"], axis=1)
olsmod_4 = sm.OLS(y_train, X_train)
olsres_4 = olsmod_4.fit()
print(olsres_4.summary())
vif_series4 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series4))
"""### Now that we do not have multicollinearity in our data, the p-values of the coefficients have become reliable and we can remove the non-significant predictor variables."""
# initial list of columns
predictors = X_train.copy()
cols = predictors.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = predictors[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
"""The selected columns are: 'main_camera_mp', 'selfie_camera_mp', 'ram', 'weight', 'release_year', 'normalized_new_price', 'brand_name_Karbonn', 'brand_name_Lenovo', 'brand_name_Xiaomi', 'os_Others', '4g_yes', '5g_yes']"""
x_train3 = X_train[selected_features]
olsmod_77 = sm.OLS(y_train, x_train3)
olsres_77 = olsmod_77.fit()
print(olsres_77.summary())
"""**After dropping the features causing strong multicollinearity and the statistically insignificant ones, our model performance hasn't dropped sharply (adj. R-squared has dropped from 0.842 to 0.838). This shows that these variables did not have much predictive power.**
**Test for linearity**
"""
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train.values.flatten() # actual values
df_pred["Fitted Values"] = olsres_77.fittedvalues.values # predicted values
df_pred["Residuals"] = olsres_77.resid.values # residuals
df_pred.head()
# let us plot the fitted values vs residuals
sns.set_style("whitegrid")
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
"""**observation**
* The residuals have no pattern, meaning that the data is linear
"""