Analysis.Rmd

---
title: "Classification of Exercise Quality based on Machine Learning Model"
author: "Hafiz Mohammad Sohaib Naim Nizami"
output: 
  html_document: 
    keep_md: yes
---
<hr>
<!-- defining writing style -->
<style>
    body {
        text-align: justify;
    }
</style>

## 1. Executive Summary  

This study includes analysis of data generated by wearable accessories' sensors while performing exercise to identify the quality of exercise being performed. The exercise that was monitored is lifting dumbbells. Several participants volunteered for this study and they were instructed and supervised to perform the exercise in five different ways: one according to the standard practice, which is the correct way of performing that exercise, and four other ways which are not according to the standard and are flawed intentionally. The aim here is to classify the quality of the exercise using the data by applying machine learning algorithms, so that these models can then be deployed in wearable devices to help people identify whether they are doing the exercise correctly or not. The model built here is successful in identifying the quality with upto 99% accuracy.

## 2. Investigation  

This section contains the study elements and details the investigation carried out for model building and prediction.  

### 2.1 Problem Statement  
As stated in the requirements in Coursera's project board: "The goal of your project is to predict the manner in which they did the exercise. This is the "classe" variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases."  

Questions needed to be answered:  
1. How you built your model?  
2. How you used cross validation?  
3. What you think the expected out of sample error is?  
4. Why you made the choices you did?  

### 2.2 Exploratory Data Analysis  

Loading the necessary libraries and data.  
```{r loadingNecData, message=FALSE, warning=FALSE, cache=TRUE, results='hide'}
# loading necessary libraries
library(dplyr)
library(caret)
library(DataExplorer)
library(ggplot2)
library(data.table)

# downloading datasets, if not downloaded already
if (!file.exists("./pml-training.csv")) {
    download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = "./pml-training.csv", method = "curl")
}
if (!file.exists("./pml-testing.csv")) {
    download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile = "./pml-testing.csv", method = "curl")
}
```  
  
Checking and removing NAs.  
```{r preProcessing, cache = TRUE}
pml <- read.csv("./pml-training.csv", na.strings = c("NA", "", " ", "#DIV/0!"))

# removing columns that contained just NAs 
nas <- sapply(pml, function(x) sum(is.na(x)))
summary(nas[nas > 0])
nacols <- which(nas > 0)
names(nacols) <- NULL

# checking how much of the data is missing from these columns
(nas[nas > 0]/nrow(pml))*100
# it appears these columns consist almost entirely of NAs 

# removing columns that contain NAs throughout
pml <- pml[, -nacols]
summary(sapply(pml, function(x) sum(is.na(x))))

# since the classification doesn't depend on timestamp data (as quoted by the stydy paper), we are going to omit them from our dataset
# removing unnecessary covariates
pml <- pml[, -c(1:7)]
pml$classe <- as.factor(pml$classe)
str(pml)
```  
Here, the first 7 columns are removed from the dataset because they contained information that wasn't going to contribute towards our classification problem.  
- The timestamp data is irrelevant because we are trying to analyze our problem based on measurements from sensors rather than performing time-based analysis. This data is just going to create unnecessary skewness in results.  
- Username and num_window just indicate the respective study subjects and the indication of the start of new cycle of exercise.  

# 2.3 Modelling  
In this section, we will evaluate different models and select one that best suites our data.  

Splitting data in training and testing data sets.  
```{r dataPartition, cache = TRUE}
set.seed(4)
inTrain <- createDataPartition(pml$classe,
                               p = 0.8,
                               list = FALSE)
training <- pml[inTrain,]
testing <- pml[-inTrain,]
```  

Then plotting the correlation of the features in the dataset.  
```{r corrPlot, fig.align='center', fig.height=8, fig.width=8, cache=TRUE}
plot_correlation(training)
```  
This plot shows that there exists a variation of correlation among several features. This can be detrimental to our analysis since we need covariates that are uncorrelated to get better results. There are several ways to deal with such problem. We can identify these highly correlated variables and omit one of them. Alternatively, we can use tree-based algorithm that is not affected by such high correlation of predictors.  

Let's train several models on our data to identify which one performs best.  
Fitting a Decision Tree model.  
```{r decisionTree, cache = TRUE}
# decision tree model
modDC <- train(classe ~ ., data = training, method = "rpart")
```
Fitting a Random Forest model.  
```{r randomForest, cache = TRUE}
# random forest model
modRF <- train(classe ~ ., data = training, method = "rf", trControl = trainControl(allowParallel = TRUE))
```
Fitting another Random Forest model based on "ranger" package.  
```{r ranger, cache = TRUE}
modRANGER <- train(classe ~ ., data = training, method = "ranger", trControl = trainControl(allowParallel = TRUE))
```
Fitting a Gradient Boosted Machine model.  
```{r gardientBoosted, cache = TRUE}
# gradient boosting model
modGBM <- train(classe ~ ., data = training, method = "gbm", verbose = FALSE, trControl = trainControl(allowParallel = TRUE))
```
Fitting a Logistic Regression model.  
```{r logisticRegression, message = FALSE, warning = TRUE, cache = TRUE, results='hide'}
# logistic regression
modGLM <- train(classe ~ ., data = training, method = "multinom", trControl = trainControl(allowParallel = TRUE))
```  
Fitting a Naive Bayes model.  
```{r naiveBayes, message=FALSE, warning=FALSE, cache = TRUE, results='hide'}
# naive bayes model
modNB <- train(classe ~ ., data = training, method = "nb", verbose = FALSE, trControl = trainControl(allowParallel = TRUE))
```
Fitting a Linear Discriminant model.  
```{r linearDA, cache = TRUE}
# linear discriminant model
modLDA <- train(classe ~ ., data =  training, method = "lda")
```

Now, checking the accuracies of these models on our test data set.  
```{r accuracies, warning=FALSE, cache=TRUE}
models <- list(modDC, modRF, modRANGER, modGBM, modGLM, modNB, modLDA)
accuracies <- sapply(models, function(x) {
  confusionMatrix(predict(x, training), training$classe)$overall[1]
})
names(accuracies) <- c("modDC", "modRF", "modRANGER", "modGBM", "modGLM", "modNB", "modLDA")
accuracies
```
Evidently, we can conclude that random forest models are performing quite well on our training datset. Consequently, we will be building our model with random forest algorithm. Since building these models is quite computationally intensive, we will move forward with just the "ranger"-based random forest model.  

Following this, we will build our model with 10-fold cross validation sampling methiod.
```{r ranger2, cache = TRUE, message=FALSE, results='hide', warning=FALSE}
# random forest model with 10-fold CV
modRANGER2 <- train(classe ~ ., 
                    data = training, 
                    method = "ranger", 
                    importance = 'impurity', 
                    trControl = trainControl(method = "repeatedcv", 
                                             number = 10, 
                                             repeats = 10, 
                                             allowParallel = TRUE))
```  
This is the model we will be using to test on our dataset. Let's have a look at this model.  
```{r ranger2Overview, cache = TRUE}
modRANGER2
```
The cross-validation has auto-tuned our model with almost optimized parameters. These parameters are:
```{r bestTune, cache = TRUE}
modRANGER2$bestTune
```  
These parameters are selected based on this plot, which is built by cross validation of parameters.

```{r modelPlot, cache = TRUE}
plot(modRANGER2)
```  
To investigate which covariates are the most important, we will analyze it with "varimp" function.
```{r varImp, cache = TRUE}
varImp(modRANGER2)
```
This can be viewed graphicaly as:
```{r varImpplot, cache = TRUE, fig.align='center', fig.height=7}
varimp <- varImp(modRANGER2)$importance
varimp$variable <- rownames(varimp)
names(varimp)[1] <- "importance"

g <- ggplot(varimp, aes(x = importance, y = reorder(variable, importance), fill = importance))
g + geom_bar(stat = "identity", position = "dodge") + 
  xlab("Importance") + 
  ylab('Features') + 
  labs(title = "Variable Importance Plot") + 
  guides(fill = F) + 
  scale_fill_gradient(low = "red", high = "dodgerblue")
```   
# 2.4 Performance Evaluation  
In this section we will examine the performance of our model on test data that we created in the start of previous section.  

```{r testData, cache = TRUE}
confusionMatrix(predict(modRANGER2, testing), testing$classe)
```  
This depicts that our model is quite efficient at classifying the observations on test data.  

<br>

Let's have an overview of errors reported by our model and what we get on test data.  
```{r errorsEval, cache = TRUE}
# this is the error estimated by model. Needless to say that this is hte Out-of-Bag error which is quite a good estimate on itself.
modelEstError <- modRANGER$finalModel$prediction.error
print(paste(round(modelEstError*100, 4), "%", sep = ""), quote = FALSE)

# now, getting the error we get by applying model on our test data
testError <- confusionMatrix(predict(modRANGER2, testing), testing$classe)$overall[1]
print(paste(round((1-testError)*100, 4), "%", sep = ""), quote = FALSE)
```  
Intriguingly, this value is quite close to what our model expected. This further solidifies the accuracy of our model.

## 3. Conclusion
To summarize, on the basis of our model fitting, we can conclude that following variables are the most significant in terms of quantifying the quality of exercise.  

```{r impVars, cache = TRUE, echo=FALSE}
varimp[order(varimp$importance, decreasing = TRUE),][1:20, 2]
```  
Furthermore, Random Forests algorithm performs quite well on data with highly correlated covariates.

<hr>