linear scratch.Rmd

---
title: "R Notebook"
output: html_notebook
---

```{r}
setwd("C:/Users/Veeru/Desktop/ML/linear regression")
getwd()
```

```{r}
train2 <- read.csv("TrainData_Group2.csv",header=TRUE, sep = ",")
str(train2)
```

##### saving the training dataset in "df1". structure of df1 shows we have 1000 rows and 6 variables. linear model will be constructed using the df1 dataset as trainin set for predicting Y varibles values in testing dataset.

```{r}
sum(is.na(train2))
```

##### checking for NA's in the dataset 

```{r}
library(ggplot2)
library(cowplot)
plot1 <- ggplot(train2, aes(X1, Y)) + geom_point(alpha=0.1)
plot2 <- ggplot(train2, aes(X2, Y)) + geom_point(alpha=0.1)
plot3 <- ggplot(train2, aes(X3, Y)) + geom_point(alpha=0.1)
plot4 <- ggplot(train2, aes(X4, Y)) + geom_point(alpha=0.1)
plot5 <- ggplot(train2, aes(X5, Y)) + geom_point(alpha=0.1)
cowplot::plot_grid(plot1,plot2,plot3,plot4,plot5)
```

##### from above plots it is visible that only X1 variable shares linear relationship. Rest all variable from X2 to X5 are scattered all over.

```{r}
x <- as.matrix(train2[-ncol(train2)])
int <- rep(1, nrow(x))
x <- cbind(int, x)
y  <- train2$Y
#checking for x matrix variables
colnames(x)
# checking y is a vector having 1000 observations
length(y)
```


##### "x" and "y" has to be in matrix form for matrix multiplication "%*%"

##### "x" matrix has now only variables from X1 to X5 of "train2"

##### "y" vector has only y variable from "train2"

##### The linear regression equation is y=B0 + B1x1 +B2x2 + B3X3 + B4x4 + B5x5. Betas are the co-efficients, one for each variable in matrix "x", thus i have B1,B2,B3,B4,B5 computed from below formula and B0 is the intercept. These parameters are the linear relationship between response variable which is "y" and predictor variables which are from "X1" to "X5". 
```{r}
betas <- t(x) %*% y %*% solve(t(x) %*% x)
```


##### as per http://faculty.cas.usf.edu/mbrannick/regression/Reg2IV.html website for calculating multivariable co-efficients:
##### generic equation for computing betas:                      
##### (summation of x.y)/(summation of x^2)                                

##### The rule for matrix multiplication is that the number of columns of first matrix should match with number of rows of the second matrix, without which the error as above will be generated.

##### breaking down the equation:
##### A.) summation of x.y:
##### which is t(x) %*% y, The dimension of matrix x is (1000,6) i.e rows,columns and that of "y" is (1000,1), the resulting matrix out of their product is not possible. If transposed the matrix "x" dimension changes to (6,1000). The product of the marices t(x).y then produces the matrix of dimension (6,1). The product within the respective elements of matrices also add up and that is where summation happens. 

##### B.) summation of x^2: 
##### which is solve(t(x) %x% x), solve() function is to find the values of co-effecients in linear or quadratic equation and also to find the inverse of given matrix. Mathematically inverse of an interger is equal to 1/the same integer eg: inverse of 10 is 1/10 or 10^-1. 
##### since we want summation of x^2 as denominator to compute betas, i will be using solve() to produce the inverse of x^2 matrix.
##### x.x is  x^2 and for matrix multiplication, transpose of matrix is considered. solve(t(x) %x% x) produces matrix of dimension (6,1000)(1000,6) = (6,6)

##### As per the generic betas formula, dim(A.)*dim(B.) =(6,1).(6,6), since the columns of A. part is not same to rows of B. part the matrix multiplication is not possible. But if the part A. is transposed then the matrix multiplication is possible.

```{r}
betas <- t(t(x) %*% y) %*% solve(t(x) %*% x)
colnames(betas) <- c("B0", "B1", "B2", "B3", "B4", "B5")
```

##### Changing the names of columns for better underestanding. 

```{r}
linearmodel <- lm(Y~.,data=train2)
```

##### comparing beta values as that of generated by lm

```{r}
comparison <- data.frame(B = t(betas), lm = linearmodel$coefficients)

print(comparison)
```

##### the coefficients are same

```{r}
B0 <- mean(y)-(betas[,2]*mean(x[,2]))-(betas[,3]*mean(x[,3]))-(betas[,4]*mean(x[,4]))-(betas[,5]*mean(x[,4]))-(betas[,5]*mean(x[,5]))
B0_values <- data.frame(B0[1],linearmodel$coefficients[1],betas[,1])
row.names(B0_values) <- NULL
B0_values
```

##### Calculating B0 using the formula B0 = mean(y)-B1.mean(X1)-B2.mean(X2)-B3.mean(X3)-B4.mean(X4)-B5.mean(X5) whish is explained in the same website link above. B0 calculated above does not match with the intercept value produced by linear model neither matches with betas matrix's variable "B0". The comparison is as shown in table above.  

```{r}
y_model_train <- rep(0,nrow(train2))
for (i in 1:nrow(train2)){
  y_model_train[i] <- betas[1] + betas[2]*train2$X1[i] + betas[3]*train2$X2[i] + betas[4]*train2$X3[i] + betas[5]*train2$X4[i] + betas[6]*train2$X5[i]
}
str(y_model_train)
```


##### The calculation above represents the linear regression equation is y=B0 + B1x1 +B2x2 + B3X3 + B4x4 + B5x5. The beta values computed above are same as statistical R's lm() function output. These beta values are used to predict the "Y" variable of the same train2 dataset. The intension is that the actual "Y" variable's values will be compared with "y_model_train" dataset and the accuracy of beta values are determined by computing R square.

```{r}
res2 <- train2$Y - y_model_train
ss_res2 <- sum(res2**2)
ss_tot2 <- sum((y-mean(y))**2)
r_sq2 <- 1-(ss_res2/ss_tot2)
r_sq2
```


##### finding the residuals "res2" by subtracting the predicted values from the actual values from train2$y and y_model,these are the errors made by fitting a line through the data points. 
##### R square value should be close to one for best fit of the linear model.

```{r}
summary(linearmodel)
```


##### Summary(linear model) shows the co-efficients and also the  R square

```{r}
comparison_rsq <- data.frame(r_sq2, summary(linearmodel)[8])
comparison_rsq
```

##### r_sq2 is the r-sq calculated without using of statistical linear code of lm() unlike r.squared. Both the values computed are same and thus prove that the calculations are accurate and since rsq is close to 1, the linear model fits well. Thus, the same beta values will be used to predict the "Y" variable of "test2" dataset.


```{r}
plot(y_model_train,train2$Y)
```


##### There's a strong correlation between the model's predictions and its actual results.


```{r}
test2 <- read.csv("TestData_Group2.csv",header=TRUE, sep = ",")
str(test2)
```

 
```{r}
sum(is.na(test2))
```

##### checking for NAN values if any.

```{r}
nrow(test2)
```


##### The co-efficient values B0...B5 are calculated using training dataset, these values will be used with testing dataset to predict unknown values.Here the known values are the predictors, which is variables from X1 to X5 of testing dataset. "Y" is to be predicted by using the formula of Y=B0 + m1.X1 + m2.X2 +...+m5.X5. B0 is the y-intercept and "B1" to "B5" are the slopes. The formula above is written within for loop, so that each observation is calculated at every iteration and the predictions are stored accordingly in y_model. 

```{r}
y_model_test2 <- rep(0,nrow(test2))
for (i in 1:nrow(test2)){
  y_model_test2[i] <- betas[1] + betas[2]*test2$X1[i] + betas[3]*test2$X2[i] + betas[4]*test2$X3[i] + betas[5]*test2$X4[i] + betas[6]*test2$X5[i]
}
y_model_test2
```



```{r}
predict2 <- predict(linearmodel,test2)
str(predict2)
```


##### test2 dataset do not have y variable values, by using predict() function y variable's values are predicted by initially training the "linear model" with train2 dataset and then using this model to predict y variable of test2 dataset.

```{r}
plot(predict2,y_model_test2)
```


##### the plot above shows that the predicted "Y" variable is same as "Y" variable computed using basic R statistical function. 
##### Putting all together for constructing linear model and prediction of "y" for train_group1 and test_group1 dataset. 

```{r}
# initializing the required libraries
library(ggplot2)
library(cowplot)
# loading the training_group1 and testing_group2 dataset
train1 <- read.csv("TrainData_Group1.csv",header=TRUE, sep = ",")
test1 <- read.csv("TestData_Group1.csv",header=TRUE, sep = ",")
# plotting Y vs X variables to look at their linear relationship
plota <- ggplot(train1, aes(X1, Y)) + geom_point(alpha=0.1)
plotb <- ggplot(train1, aes(X2, Y)) + geom_point(alpha=0.1)
plotc <- ggplot(train1, aes(X3, Y)) + geom_point(alpha=0.1)
plotd <- ggplot(train1, aes(X4, Y)) + geom_point(alpha=0.1)
plote <- ggplot(train1, aes(X5, Y)) + geom_point(alpha=0.1)
plots <- cowplot::plot_grid(plota,plotb,plotc,plotd,plote)
print(plots)
# creating X and Y matrix
x <- as.matrix(train1[-ncol(train1)])
int <- rep(1, nrow(x))
x <- cbind(int, x)
y  <- train1$Y
# calculating Betas/ Co-efficients and then linear model of ML algorithm and then comparing betas from both the methods
betas <- t(t(x) %*% y) %*% solve(t(x) %*% x)
colnames(betas) <- c("B0", "B1", "B2", "B3", "B4", "B5")
linearmodel <- lm(Y~.,data=train1)
comparison <- data.frame(B = t(betas), lm = linearmodel$coefficients)
print(comparison)
# The newly predicted "y" variable is compared with actual "y" variable of train1 dataset.
y_model_train <- rep(0,nrow(train1))
for (i in 1:nrow(train1)){
  y_model_train[i] <- betas[1] + betas[2]*train1$X1[i] + betas[3]*train1$X2[i] + betas[4]*train1$X3[i] + betas[5]*train1$X4[i] + betas[6]*train1$X5[i]
}
res1 <- train1$Y - y_model_train
ss_res1 <- sum(res1**2)
ss_tot1 <- sum((y-mean(y))**2)
r_sq1 <- 1-(ss_res1/ss_tot1)
comparison_rsq <- data.frame(r_sq1, summary(linearmodel)[8])
print(comparison_rsq)
plot1 <- plot(y_model_train,train1$Y)
# using beta values to predict "y" variable for test1 dataset.
y_model_test1 <- rep(0,nrow(test1))
for (i in 1:nrow(test1)){
  y_model_test1[i] <- betas[1] + betas[2]*test1$X1[i] + betas[3]*test1$X2[i] + betas[4]*test1$X3[i] + betas[5]*test1$X4[i] + betas[6]*test1$X5[i]
}
predict1 <- predict(linearmodel,test1)
plot_pred <- plot(predict1,y_model_test1)
```


Reference links:
https://datascienceplus.com/linear-regression-from-scratch-in-r/ algorithm
https://www.youtube.com/watch?v=NUXdtN1W1FE linear regression
https://www.youtube.com/watch?v=J_LnPL3Qg70&t=408s python LR