Skip to content

Tanwar-12/AUTO-MOBILE-PRICE-PREDICTION

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

MACHINE LEARNING PROJECT

AUTO-MOBILE PRICE PREDICTION

imageimage

1. BUSINESS CASE: THE AIM IS TO PREDICT THE PRICE OF CAR USING ALL THE GIVEN FEATURES

2. IMPORTING THE PYTHON LIBRARIES

3. LOADING THE DATASET

4. DOMAIN ANALYSIS

  1. symboling: This rating corresponds to the degree to which the auto is more risky than its price indicates.

    • Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale.
    • Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. (-3, -2, -1, 0, 1, 2, 3.)
  2. normalized-losses: This factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons,sports/speciality, etc...), and represents the average loss per car per year. (continuous from 65 to 256)

  3. make: This represents the maker/make of auto. (alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo)

  4. fuel-type: Type of fuel used in engine. (diesel, gas)

  5. aspiration: If the auto has turbo or natually aspirated engine. (std, turbo)

  6. num-of-doors: Number of doors in auto. (four, two)

  7. body-style: Type/style of auto body. (hardtop, wagon, sedan, hatchback, convertible)

  8. drive-wheels: The wheel drive system which transmits force causing the auto to move. (4wd, fwd, rwd)

  9. engine-location: Placement of engine in auto. (front, rear)

  10. wheel-base: The distance between the wheel axles - centers of front and rear wheels. (continuous from 86.6 120.9)

  11. length: Length of auto. (continuous from 141.1 to 208.1)

  12. width: Width of auto. (continuous from 60.3 to 72.3)

  13. height: Height of auto. (continuous from 47.8 to 59.8)

  14. curb-weight: Weight of the car with standard components. (continuous from 1488 to 4066)

  15. engine-type: dohc(double overhead camp), dohcv, l, ohc, ohcf, ohcv(overhead camp valve), rotor.

  16. num-of-cylinders: Number of cylinders used in auto. (eight, five, four, six, three, twelve, two)

  17. engine-size: Size of auto engine. (continuous from 61 to 326)

  1. fuel-system: The system that helps transfer fuel to the engine. (1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi)
  1. bore: Hollow part inside engine/inner diameter of the cylinder. (continuous from 2.54 to 3.94)
  2. stroke: The full travel of piston along the cylinder. (continuous from 2.07 to 4.17)
  3. compression-ratio: The ratio of maximum to minimum volume in the cylinder of an internal combustion engine. (continuous from 7 to 23
  4. horsepower: Power that an engine produces. (continuous from 48 to 288)
  5. peak-rpm: RPM that the engine produces at highest horsepower. (continuous from 4150 to 6600)
  6. city-mpg: Lowest mpg rating for an auto. (continuous from 13 to 49)
  7. highway-mpg: Highest/average mpg rating of an auto while driving on an open stretch road. (continuous from 16 to 54)
  8. price: Cost of auto. (continuous from 5118 to 45400)

BASIC CHECKS

5.EXPLORATORY DATA ANALYSIS(EDA)

5.1 UNIVARIATE ANALYSIS:

DATA INSIGHTS:

  • More than 30% of cars have a risk value of 0 which means that its averagly safe to buy.
  • 15% of the cars are of the make Toyota.
  • 90% of cars have gas engine and only 10% have diesel fuel engine.
  • 82% of cars have naturally aspirated engine and 12% have turbo engines.
  • 56% of cars have 4 doors, 43% have 2 doors.
  • More than 45% of cars are Sedan and around 34% are hatchbacks.
  • Almost 60% of cars have forward wheel drive.
  • Just 1% of cars have engine located in the rear end of car.
  • For most of the cars, wheel base is around 95 to 100.
  • Around 25% of cars are of length 170units. This feature shows almost normal distribution.
  • Around 25% of cars are of width range 66 to 68units.
  • Most of the cars are of height range 55 to 56units.
  • 25% of cars have curb weight of 2500units.
  • More than 70% of cars have ohc (OverHead Camp) kind of engine.
  • Almost 80% of cars have 4 cylinders.
  • 45% of car's engine size is 100.
  • Around 45% of cars have mpfi kind of fuel system.
  • Most of the car's bore is more than 3.8units.
  • Most of the car's stroke length is more than 3.5units.
  • 65% of cars have compression ratio of 10
  • More than 18% of cars have a horsepower of 68hp.
  • Most of the cars have peak rpm of 5500 or 4800.
  • Around 30% of cars give a milege of 25 in city.
  • More than 20% of cars give a milege in the range of 25 to 35 in highway.
  • 40% of cars have a price range of 5 to 10k.

5.2 BIVARIATE ANALYSIS:

image

DATA INSIGHTS:

  • Price of Jaguar cars is the highest.
  • The price of diesel engine cars is higher than gas engine cars.
  • Turbo engined cars are pricey than the naturally aspirated cars.
  • Cars with 4 doors are costlier than 2 door cars.
  • Convertible and Hardtop type of cars have higher price than other types.
  • Reverse wheel drive cars have higher price than forward or 4 wheel drive cars.
  • Rear engine cars are costlier than front engine cars.
  • OverHead Camp Valve (ohcv) engine have higher price.
  • 8 cylinder engine cars have highest price.
  • mpfi fuel system car has the highest price. image

DATA INSIGHTS:

  • At an average normalized loss of 142, the price of cars is 35k.

  • Cars with bore measuring 3.8units has the highest price.

  • Cars with stroke length of 2.76units has the highest price.

  • A horsepower of 184 is paid the highest price.

  • Car with peak RPM of 5900 is the costliest.

  • A car with symboling value of -1 has higher price than that with value of 3.

  • Car with wheel base of 112 is paid highest.

  • Car with length of 199.2, width of 72 and height of 55.4 is highly paid.

  • Car with curb weight of 3715 is highly paid.

  • Engine size of 304 has highest price.

  • Compression ratio of 11.5 is paid more.

  • A city mpg of 14 is sold at highest price.

  • A highway mpg of 16 is sold at highest price.

    5.3 MULTIVARIATE ANALYSIS:

    image

    image

    • Data Insight:Marcedes-benz with gas fuel type has highest price image
    • Data Insight: With increase in weight of vehicle, city mpg decreases.

    6.FEATURE ENGINERING/ DATA PREPROCESSING

    6.1 Checking for missing or null values

    6.2 Converting categorical data to numerical data

    6.3 Using Label Encoder

    6.4 HANDLING OUTLIERS

    image

    7.FEATURE SELECTION

    7.1 Checking for correlation

image

  • USE OF HEAT MAP: image

8.MODEL BUILDING & EVALUATION

LINEAR REGRESSION:

  • Training R2 accuracy using Linear Regression is: 86.53144036257537
  • Testing R2 accuracy using Linear Regression is: 76.72938159080243
  • Testing Adjusted R2 score is: 48.28751464622761
  • MSE score is: 28470850.70003938
  • RMSE score is: 5335.808345512363
  • MAE score is: 3406.7596001032775

K-Neighbors Regressor:

  • Training R2 accuracy using KNN Regression is: 82.50133136918122
  • Testing R2 accuracy using KNN Regression is: 64.93533173518867
  • Testing Adjusted R2 score is: 22.078514967085916
  • MSE score is: 42900490.114146344
  • RMSE score is: 6549.846571801995
  • MAE score is: 3946.4146341463415

SVM:

  • Training R2 accuracy using SVM Regression is: -9.315945709087071
  • Testing R2 accuracy using SVM Regression is: -21.91990376994277
  • Testing Adjusted R2 score is: -170.9331194887617
  • MSE score is: 149165068.0080445
  • RMSE score is: 12213.315193183402
  • MAE score is: 7943.122270031737

DECISION TREE REGRESSION:

  • Training R2 accuracy using Decision tree Regression is: 99.89831862870213
  • Testing R2 accuracy using Decision tree Regression is: 89.4528958266559
  • Testing Adjusted R2 score is: 76.56199072590202
  • MSE score is: 12904041.609756097
  • RMSE score is: 3592.219593754827
  • MAE score is: 1950.3902439024391

GRADIENT BOOSTING:

  • The training R2 accuracy using Gradient Boosting is: 99.22053302005716
  • The testing R2 accuracy using Gradient Boosting is: 94.9681530034094
  • Testing Adjusted R2 score is: 88.81811778535422
  • MSE score is: 6156302.426786809
  • RMSE score is: 2481.1897200308586
  • MAE score is: 1656.7798100138448

XGBOOST REGRESSOR:

  • The training R2 accuracy using XG Boost is: 99.89824021512386
  • The testing R2 accuracy using XB Boost is: 92.59952491032328
  • Testing Adjusted R2 score is: 83.5544998007184
  • MSE score is: 9054242.465007683
  • RMSE score is: 3009.026830223965
  • MAE score is: 1908.5585580221036

RANDOM FOREST REGRESSOR

  • The training R2 accuracy using Random Forest is: 96.98578103166696
  • The testing R2 accuracy using Random Forest is: 86.21079174165337
  • Testing Adjusted R2 score is: 69.35731498145194
  • MSE score is: 16870651.33773875
  • RMSE score is: 4107.389844869701
  • MAE score is: 2489.5789674796747

AFTER HPYERPARAMETER TUNING

  • Training R2 score after Hyperparameter tuning on Random forest is: 99.89831862870213
  • Testing R2 score after Hyperparameter tuning on Random forest is: 94.451462613851
  • Testing Adjusted R2 score is: 87.6699169196689
  • MSE score is: 6788456.445239862
  • RMSE score is: 2605.466646349529
  • MAE score is: 1553.0564634146342

So far, Gradient Boosting and Random Forest algorithms have given the best scores:

GRADIENT BOOSTING

  • The training R2 accuracy using Gradient Boosting is: 99.22053302005716

  • The testing R2 accuracy using Gradient Boosting is: 94.9681530034094

  • Testing Adjusted R2 score is: 88.81811778535422

  • MSE score is: 6156302.426786809

  • RMSE score is: 2481.1897200308586

  • MAE score is: 1656.7798100138448

    RANDOM FOREST

  • Training R2 score after Hyperparameter tuning on Random forest is: 99.89831862870213

  • Testing R2 score after Hyperparameter tuning on Random forest is: 94.451462613851

  • Testing Adjusted R2 score is: 87.6699169196689

  • MSE score is: 6788456.445239862

  • RMSE score is: 2605.466646349529

  • MAE score is: 1553.0564634146342

    CROSS VALIDATION SCORES:

    [ 0.6154544 0.36697642 -0.85743858]

  • Cross validation score of Linear Regression model is: 0.041664079961660404 [ 0.30571883 0.55311535 -0.69954503]

  • Cross validation score of KNN model is: 0.05309638233246209 [-0.20935183 -0.2448066 -0.01494369]

  • Cross validation score of SVR model is: -0.15636737650881907 [0.35543453 0.4146508 0.12383546]

  • Cross validation score of Decision Tree model is: 0.2979735956409533 [0.56014578 0.65703396 0.39638363]

  • Cross validation score of Gradient Boost model is: 0.5378544577413296 [0.62451125 0.61963064 0.65658258]

  • Cross validation score of XG Boost model is: 0.6335748203051849 [0.76522458 0.62552531 0.63047936]

  • Cross validation score of Random Forest model is: 0.6737430824341969

  • Training R2 score after Hyperparameter tuning on Random forest is: 99.89831862870213

  • Testing R2 score after Hyperparameter tuning on Random forest is: 94.451462613851

  • Testing Adjusted R2 score is: 87.6699169196689

  • MSE score is: 6788456.445239862

  • RMSE score is: 2605.466646349529

  • MAE score is: 1553.0564634146342

  • Cross validation score of Random Forest model is: 0.6737430824341969

    image

    RESULT: Since Random Forest model comparitively has higher train, test and Cross validation scores and lower MSE, RMSE, MAE score, we choose this model for this problem