CS_03.Rmd

---
title: "Simple Linear Regression Analysis"
author: "Mohammad Ali Momen"
date: "05/03/2023"
output:
  html_document:
    toc: true
    toc_float: true
    toc_depth: 4
    number_sections: true
    self_contained: true
    code_download: true
    code_folding: show
    df_print: paged
  md_document:
    toc: true
    toc_depth: 2
    toc_float: true
    number_sections: true
    variant: markdown_github
  html_notebook: default
  pdf_document: default
  word_document: default
---

```{css, echo=FALSE}
pre {
  max-height: 300px;
  overflow-y: auto;
}

pre[class] {
  max-height: 200px;
}
```

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, attr.source = '.numberLines')
```

***

**Data Analysis methodology**: CRISP-DM

**Dataset**: Toyota Used Cars certified features and dealing (sold) prices in Europe

**Case Goal**: Price Recommendation Intelligence System for Toyota Used Cars in Europe Trading Platform

***

# Required Libraries
```{r}
library('ggplot2')
library('car')
library('corrplot')
library('moments')
```

***


# Read Data from File
```{r}
data <- read.csv('CS_03.csv', header = T)
dim(data)  # 1325 records, 10 variables
```

***


# Business Understanding

* know business process and issues
* know the context of the problem
* know the order of numbers in the business

***


# Data Understanding
## Data Inspection
Data Understanding from Free Perspective

### Dataset variables definition
```{r}
colnames(data)
```

* **Price**: Sales (sold) price in Euro -> what we want to predict
* **Age**: Age of a used car in month 
* **KM**: Kilometerage usage
* **FuelType**: Petrol, Diesel, CNG -> Categorical (factor)
* **HP**: Horse power     
* **MetColor**: -> Categorical (factor)
  + 1 = if Metallic color
  + 0 = Not 
* **Automatic**: -> Categorical (factor)
  + 1 = if Automatic
  + 0 = Not 
* **CC**: Engine displacement in cc
* **Doors**: # of doors -> Categorical (factor)
* **Weight**: Weight in Kilogram


## Data Exploring
Data Understanding from Statistical Perspective

### Overview of Dataframe
```{r}
class(data)
head(data)
tail(data)
str(data)
summary(data)
```

### Categorical variables should be stored as factor
```{r}
cat_var <- c('FuelType', 'MetColor', 'Automatic', 'Doors')
data[, cat_var] <- lapply(data[, cat_var], factor)

summary(data)
```

> We have few data|sample in **CNG** category of **FuelType** -> it can affect on price prediction of this cars category

> We have few data|sample in **1** category of **Automatic** -> it can affect on price prediction of this cars category

### Univariate Profiling
Check each variable individually

#### Categorical variables
Check to sure that have good car distribution in each category
```{R}
table(data$FuelType)  # CNG cars sample size is very small -> 17/nrow(data) < 0.05
table(data$MetColor)
table(data$Automatic)
table(data$Doors)  # 2-door cars sample size is very small -> 2/nrow(data) < 0.05

data[data$Doors == 2,]  # abnormality (error in data recording process)
```

#### Continuous variables
```{r}
par(mfrow = c(2, 3))
for(i in c(1, 2, 3, 5, 8, 10)){
	hist(data[,i], xlab='', main=paste('Histogram of', colnames(data)[i]))  # plot 6 histogram of continuous variables in one chart
}  # 'Price' is skewed to right

par(mfrow = c(1, 1))

boxplot(data$Price, main='Price Distribution')  # outlier detection by Tukey method in Price
```

### Bivariate Profiling
Measure 2-2 relationships between variables

#### Two Continuous variables (Correlation Analysis)
```{r}
cor(data$Price, data$KM)  # high correlation for this context (Used Car price)
cor_table <- round(cor(data[,c(1, 2, 3, 5, 8, 10)]), 2)  # correlation table between price and continuous variables
cor_table   # choose continuous variables which have high corr with price and consider them as feature in regression model (which variable is important for price prediction)
corrplot(cor_table)
```

> **CC** has very small corr with **price**, so it can not be good predictor in modeling

__*Multicollinearity*__ (having high correlation between predictor variables):

abs(corr) >= 0.30: Multicollinearity problem danger!

* **Weight** has 0.66 corr with **CC**
* **KM** has 0.39 corr with **Age**
* **KM** has 0.39 corr with **CC**
* **KM** has -0.33 corr with **HP**

Scatter Plot (between price and other continuous variables 2 by 2)
```{r}
par(mfrow = c(2, 3))
for(i in c(2, 3, 5, 8, 10)){
	plot(data[,i], data$Price, xlab='', main=paste('Price vs.', colnames(data)[i]))
}
par(mfrow = c(1, 1))
```

***

# Data PreProcessing
## Divide Dataset into Train and Test randomly

Learn model in Train dataset

Evaluate model performance in Test dataset

```{r}
set.seed(123456)
train_cases <- sample(1:nrow(data), nrow(data) * 0.7)  # according to the dataset size: 70% - 30% 
train <- data[train_cases, ]
test <- data[-train_cases, ]
```

Train data distribution must be similar to test data distribution
```{r}
dim(train)
summary(train)
dim(test)
summary(test)
```

***

# Modeling
## Train Simple Linear Regression Model 1 (Univariate Regression)

> Based-on previous analysis, it seems that **KM** is important to explain **Price** variance (corr = -52%)

Regress Price on KM
```{r}
m1 <- lm(Price ~ KM, data = train)
m1  # regression equation
summary(m1)  # results of m1 regression model
```

> R^2^ = 0.2624: 26% of **Price** variance has been explained by **KM**

> Consider the problem context, for price prediction, R^2^ = 0.26 is not good model, we need 0.70

```{r}
ggplot(train, aes(x=KM, y=Price)) +
	geom_point() +
	geom_smooth(method='lm', se=F)  # variance of 'Price' based-on 'KM' is high around regression line
```

**Main Question**: can we generalize this line to population? -> F-test and then t-test

Check Assumptions of Regression

1. Normality of residuals (Errors)
```{r}
m1$residuals

hist(m1$residuals, probability = T)  # skewed to right (have a tail along right)
lines(density(m1$residuals), col='red')

qqnorm(m1$residuals, main='QQ Plot of residuals', pch = 20)  # we have serious deviations from normal distribution
qqline(m1$residuals, col='red')

#p-value < 0.05 reject normality assumption
jarque.test(m1$residuals)

#p-value < 0.05 reject normality assumption
anscombe.test(m1$residuals)
```

> **result**: Residuals are not Normally Distributed -> reject first Assumption of Regression

2. Residuals independency
```{r}
plot(m1)  # Diagnostic Plots
```

> **result**: We see Heteroscedasticity problem in model

```{r}
plot(data$KM, data$Price)
```

> It seems that the relationship between these two variable in this sample and this data-range isn't linear, it is non-linear relationship

so, this model has problem. and t-test results of it are not reliable yet!

## Train Simple Linear Regression Model 2 (Multivariate Regression)
```{r}
m2 <- lm(Price ~ KM + I(KM^2), data = train)  # Quadratic Regression
summary(m2)
```

Check Assumptions of Regression

1. Normality of residuals (Errors)
```{r}
hist(m2$residuals, freq = F)
lines(density(m2$residuals), col = 'red')  # right skewed

qqnorm(m2$residuals, main = 'QQ Plot of residuals', pch = 20)
qqline(m2$residuals, col = 'red')

#p-value < 0.05 reject normality assumption
jarque.test(m2$residuals)

#p-value < 0.05 reject normality assumption
anscombe.test(m2$residuals)
```

> **result**: Residuals are not Normally Distributed -> reject first Assumption of Regression

2. Residuals independency
```{r}
plot(m2)  # Diagnostic Plots
```

Linear vs. Quadratic Regression
```{r}
ggplot(train, aes(KM, Price)) +
	geom_point() +
	geom_smooth(method = 'lm', formula = y ~ x + I(x^2)) +
	geom_smooth(method = 'lm', formula = y ~ x, color = 'red') +
	ggtitle('Price vs. KM')
```

Back to errors non-normality problem: maybe some outlier observations make them non-normal
(in other words: residuals have not systematic deviations from normal distribution and just some outliers make them non-normal)
```{r}
plot(m2)
train2 <- train[-which(rownames(train) == 7 | rownames(train) == 32 | rownames(train) == 1325),]  # remove three outliers (based-on residuals QQ-plot) from dataset
```

## Model 2_2
```{r}
m2_2 <- lm(Price ~ KM + I(KM^2), data = train2)
summary(m2_2)
```

Normality of residuals test
```{r}
hist(m2_2$residuals, probability = TRUE)
lines(density(m2_2$residuals), col = "red")  # skewed to right

qqnorm(m2_2$residuals, main = "QQ Plot of residuals", pch = 20)
qqline(m2_2$residuals, col = "red")

#p-value < 0.05 reject normality assumption
jarque.test(m2_2$residuals)

#p-value < 0.05 reject normality assumption
anscombe.test(m2_2$residuals)
```

> **result**: errors are not normally distributed

> **Note**: Deviation from normality assumption got improved (because we remove 3 outliers)

so, maybe Deviation from normality is just because some observations (and if remove more outliers, it will become normal)

```{r}
plot(m2_2)  # Diagnostic Plots
```

How we measure|detect multicollinearity? use VIF index

If VIF > 10 then multicollinearity problem is high (big VIF is bad)
```{r}
car::vif(m2)
car::vif(m2_2)
```

Scale 'KM' to solve VIF
```{r}
train$KM_scaled <- scale(train$KM)  # Z-Normalization
head(train)
```

## Model 2_3
```{r}
m2_3 <- lm(Price ~ KM_scaled + I(KM_scaled^2), data = train)
summary(m2_3)
plot(m2_3)
car::vif(m2_3)
```

* Model m2 pluralization: 
  + consider KM and KM^2^ (are important variables to explain 'Price': explain 30% variance of 'Price')
  + improve R^2^
  + solve Heteroscedasticity problem
  + Regression Normality Assumption violation did not solve -> maybe is just because some outliers (if be true, we can remove them and solve this problem)
  + we use KM_scaled and KM_scaled^2^ to prevent Multicollinearity problem

## Model 3: we want to bring all variables of dataset to model
```{r}
summary(train[,c(4,6,7,9)])
train$FuelType <- factor(train$FuelType, levels = c("Petrol", "Diesel", "CNG"))
train$MetColor <- factor(train$MetColor, levels = c("1", "0"))
train$Automatic <- factor(train$Automatic, levels = c("0", "1"))
train$Doors <- factor(train$Doors, levels = c("3", "5", "4", "2"))

m3 <- lm(Price ~ KM_scaled + I(KM_scaled^2) + Age + FuelType + HP + MetColor + Automatic + CC + Doors + Weight, data = train)
summary(m3)

#1- Heteroscedasticity -> we have a little problem
#2- QQ-plot -> serious skewness to left
#3- like 1
#4- we have one observation in Cook's Distance zone (has huge effect on our model)
plot(m3)
```

## Model 3_2: 
Revise m3 model with significant variables to predict 'Price':
```{r}
m3_2 <- lm(Price ~ KM_scaled + I(KM_scaled^2) + Age + FuelType + Automatic + CC + Weight + Doors, data = train)
summary(m3_2)
plot(m3_2)
car::vif(m3_2)
```

## Model 4
Create another variable:
```{r}
train$IfPetrol <- ifelse(train$FuelType == 'Petrol', T, F)  # put the Petrol as the base category
head(train)
summary(train$IfPetrol)  # solve CNG small sample size problem and make balance between categories (Capping)
```

* Remove **Doors** from model (does not have significant effect on Price according to t-test results)
* Put **IfPetrol** instead of **FuelType**
* Remove **HP**
* Remove **MetColor** (p-value is too high)
```{r}
m4 <- lm(Price ~ KM_scaled + I(KM_scaled^2) + Age + IfPetrol + Automatic + CC + Weight, data = train)

summary(m4)
```

> **Adjusted R^2^** did not change -> confirm that the removed variables were not important variables for explanation of Price

> **F-test**: p-value < 0.05 -> there is at least one linear relationship between one of the predictor variables and Price

Check Assumptions of Regression

1. Normality of residuals
```{r}
hist(m4$residuals, freq = F, breaks = 25)
lines(density(m4$residuals), col='red')  # a little skewed to left

qqnorm(m4$residuals, main='QQ Plot of residuals', pch = 20)
qqline(m4$residuals, col = 'red')

#p-value < 0.05 reject normality assumption
jarque.test(m4$residuals)

#p-value < 0.05 reject normality assumption
anscombe.test(m4$residuals)
```

> **result**: Residuals are not Normally Distributed! (normality assumption did not confirm)

Diagnostic Plots
```{r}
#1- Hetersocedasticity: a little
#2- QQ-plot
#3- like 1
#4- two outliers based-on Cook's Distance (if we remove them, will have important change on Regression line)
plot(m4)
```


> **Model 4_2**: remove some outlier observations to improve model (to close model errors distribution to Normal)

Maybe just some outlier observations are cause of Errors Normality Assumption violation. 

```{r}
plot(m4)
```

* Remove observations 112 and 491 and 850 (have high Leverage on Regression Line based-on Cook's Distance)
* Remove observations 491 and 284 and 283 (based-on QQ-plot)
```{r}
train2 <- train[-which(rownames(train) %in% c(112, 284, 293, 491, 850)), ]  # iteration 1) remove some outliers based-on Diagnostic Plots

m4_2 <- lm(Price ~ KM_scaled + I(KM_scaled^2) + Age + IfPetrol + Automatic + CC + Weight, data = train2)
summary(m4_2)
```

* **R^2^** improved
* **KM^2^** marginal significant
* **Automatic1** is not significant
* **CC** is not significant

```{r}
plot(m4_2)

train2 <- train[-which(rownames(train) %in% c(112, 114, 284, 293, 491, 544, 850, 948, 1325)),]  # iteration 2) remove some outliers based-on Diagnostic Plots
m4_2 <- lm(Price ~ KM_scaled + I(KM_scaled^2) + Age + IfPetrol + Automatic + CC + Weight, data = train2)
summary(m4_2)
plot(m4_2)

train2 <- train[-which(rownames(train) %in% c(77, 112, 114, 284, 293, 491, 544, 586, 803, 850, 881, 944, 948, 1241, 1325)), ] #iteration 3) remove some outliers based-on Diagnostic Plots
m4_2 <- lm(Price ~ KM_scaled + I(KM_scaled^2) + Age + IfPetrol + Automatic + CC + Weight, data = train2)
summary(m4_2)
```

> **Automatic1** is not significant (has high p-value)

```{r}
plot(m4_2)
```

## Model 4_3: remove Automatic
```{r}
m4_3 <- lm(Price ~ KM_scaled + I(KM_scaled^2) + Age + CC + IfPetrol + Weight, data = train2)
summary(m4_3)
```

* **F-test**: p-value < 0.05
* **R^2^**: 83%
* **t-test**: all variables are significant

Normality of residuals (Check Normality Assumption)
```{r}
hist(m4_3$residuals, breaks = 20, freq = F)
lines(density(m4_3$residuals), col = 'red')

qqnorm(m4_3$residuals, main='QQ Plot of residuals', pch = 20)
qqline(m4_3$residuals, col='red')

#p-value < 0.05 reject normality assumption
jarque.test(m4_3$residuals)

#p-value < 0.05 reject normality assumption
anscombe.test(m4_3$residuals)
```

> **result**: residuals are Normal

Diagnostic plots
```{r}
plot(m4_3)  # all are good
```

Check Multicollinearity
```{r}
car::vif(m4_3)  # have no value over 10 (or over 5) -> there isn't Mulitcollinearity problem
```

Compare Two models
```{r}
coef(m4)  # model with outliers
coef(m4_3)  # model without outliers
```

Number of removed cases:
```{r}
dim(train)
dim(train2)
nrow(train)-nrow(train2)  # 15 Observations
(nrow(train)-nrow(train2))/nrow(train)*100  # 1.6% of observations is removed
```

***

# Model Evaluation
## Test the Model 4_3 performance

Data Preparation
```{r}
head(test)
test$KM_scaled <- scale(test$KM)
test$IfPetrol <- ifelse(test$FuelType == 'Petrol', T, F)
head(test)
dim(test)
```

Prediction
```{r}
test$pred <- predict(m4_3, test)  # input test data into m4_3 model (use model m4_3 to predict Price on Test dataset)
head(test)
```

## Evaluate model performance in Test dataset:

Actual vs. Prediction
```{r}
plot(x = test$Price, y = test$pred, xlab = 'Actual', ylab = 'Prediction')
abline(a = 0, b = 1, col = 'red', lwd = 3)  # compare with 45' line
```

Absolute Error mean, median, sd, max, min
```{r}
abs_err <- abs(test$Price - test$pred) #absolute value (AEV)

hist(abs_err, breaks = 25)  # residuals distribution
mean(abs_err)
median(abs_err)
sd(abs_err)
max(abs_err)
min(abs_err)
```

Boxplot (which observations are outliers?)
```{r}
boxplot(abs_err, main = 'Error distribution')
```

Error Percentage median, sd, mean, max, min
```{r}
e_percent <- abs(test$Price - test$pred)/test$Price * 100
summary(e_percent)

sum(e_percent <= 15)/length(e_percent)*100  # 84.17% of predicted prices fall between [-15%, +15%] actual price
```

> **Conclusion**: we created a robust model

***

For more information check the [Github](https://github.com/mamomen1996/R_CS_03) repository.