A2_P2_LangASD_predictions_instructions.Rmd

---
title: "Assignment 2 - Language Development in ASD - Making predictions"
author: "Riccardo Fusaroli"
date: "August 9, 2019"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Welcome to the second exciting part of the Language Development in ASD exercise

In this exercise we will delve more in depth with different practices of model comparison and model selection, by first evaluating your models from last time against some new data. Does the model generalize well?
Then we will learn to do better by cross-validating models and systematically compare them.

The questions to be answered (in a separate document) are:
1- Discuss the differences in performance of your model in training and testing data
2- Which individual differences should be included in a model that maximizes your ability to explain/predict new data?
3- Predict a new kid's performance (Bernie) and discuss it against expected performance of the two groups

## Learning objectives

- Critically appraise the predictive framework (contrasted to the explanatory framework)
- Learn the basics of machine learning workflows: training/testing, cross-validation, feature selections

## Let's go

N.B. There are several datasets for this exercise, so pay attention to which one you are using!

1. The (training) dataset from last time (the awesome one you produced :-) ).
2. The (test) datasets on which you can test the models from last time:
* Demographic and clinical data: https://www.dropbox.com/s/ra99bdvm6fzay3g/demo_test.csv?dl=1
* Utterance Length data: https://www.dropbox.com/s/uxtqqzl18nwxowq/LU_test.csv?dl=1
* Word data: https://www.dropbox.com/s/1ces4hv8kh0stov/token_test.csv?dl=1

### Exercise 1) Testing model performance

How did your models from last time perform? In this exercise you have to compare the results on the training data () and on the test data. Report both of them. Compare them. Discuss why they are different.

- recreate the models you chose last time (just write the model code again and apply it to your training data (from the first assignment))
- calculate performance of the model on the training data: root mean square error is a good measure. (Tip: google the function rmse())
- create the test dataset (apply the code from assignment 1 to clean up the 3 test datasets)
- test the performance of the models on the test data (Tips: google the functions "predict()")
- optional: predictions are never certain, can you identify the uncertainty of the predictions? (e.g. google predictinterval())


```{r, include = FALSE}
setwd("C:/Users/louis/OneDrive - Aarhus universitet/AU Onedrive - RIGTIG/- 3. Semester/Experimental Methods III/Classes/Assignment 2/Assignment-2")

# Loading packages
install.packages('BiocManager')
library(pacman)
pacman::p_load(tidyverse, caret, cvms, ModelMetrics)

## Clean up function, included to inspire you

CleanUpData <- function(Demo,LU,Word){
  
  Speech <- merge(LU, Word) %>% 
    rename(
      Child.ID = SUBJ, 
      Visit = VISIT) %>%
    mutate(
      Visit = as.numeric(str_extract(Visit, "\\d")),
      Child.ID = gsub("\\.","", Child.ID) #Replaces dots with nothing
      ) %>%
    dplyr::select(
      Child.ID, Visit, MOT_MLU, CHI_MLU, types_MOT, types_CHI, tokens_MOT, tokens_CHI
    )
  
  Demo <- Demo %>%
    dplyr::select(
      Child.ID, Visit, Ethnicity, Diagnosis, Gender, Age, ADOS, MullenRaw, ExpressiveLangRaw, Socialization
    ) %>%
    mutate(
      Child.ID = gsub("\\.","", Child.ID) #Replaces dots with nothing
    )
    
  Data=merge(Demo,Speech,all=T)
  
  Data1= Data %>% 
     subset(Visit=="1") %>% 
     dplyr::select(Child.ID, ADOS, ExpressiveLangRaw, MullenRaw, Socialization) %>%
     rename(Ados1 = ADOS, 
            verbalIQ1 = ExpressiveLangRaw, 
            nonVerbalIQ1 = MullenRaw,
            Socialization1 = Socialization) 
  
  Data=merge(Data, Data1, all=T) %>%
    mutate(
      Child.ID = as.numeric(as.factor(as.character(Child.ID))), #Anonymising participants
      Visit = as.numeric(as.character(Visit)),
      Gender = recode(Gender, 
         "1" = "M",
         "2" = "F"),
      Diagnosis = recode(Diagnosis,
         "A"  = "TD",
         "B"  = "ASD")
    )

  return(Data)
}

# Load training Data
Demo_train <- read.csv("demo_train.csv")
LU_train <- read.csv("LU_train.csv")
Word_train <- read.csv("token_train.csv")

# Using the function
df_train <- CleanUpData(Demo_train, LU_train, Word_train)

#- recreate the models you chose last time (just write the code again and apply it to Train Data)
m1 <- lmer(CHI_MLU ~ Visit * Diagnosis * verbalIQ1 + (1 + Visit|Child.ID), df_train, REML = F)
summary(m1)

#- calculate performance of the model on the training data: root mean square error is a good measure. (Tip: google the function rmse())
actual <- na.omit(df_train$CHI_MLU) # ignores the NAs
prediction <- predict(m1) # m1 predicts CHI_MLU in this model
prediction
ModelMetrics::rmse(actual, prediction) # How is the performance? 0.36 (the closer to zero, the better)
# Average error on a kid (that we have seen already) is 0.36 morphemes. You could assess the SD(CHI_MLU) as well...

#- create the test dataset (apply the code from assignment 1 or my function to clean up the 3 test datasets)
# Test data
# Loading test data
Demo_test <- read.csv("demo_test.csv")
LU_test <- read.csv("LU_test.csv")
Word_test <- read.csv("token_test.csv")

# Cleaning the data
df_test <- CleanUpData(Demo_test, LU_test, Word_test)

#- test the performance of the models on the test data (Tips: google the functions "predict()")
actual <- na.omit(df_test$CHI_MLU) # ignores the NAs
prediction <- predict(m1, subset(df_test, !is.na(CHI_MLU))) # m1 predicts CHI_MLU in this model
ModelMetrics::rmse(actual, prediction) # How is the performance? (the closer to zero, the better. This is okay (0.66), but still worse than the model is on the training set)

# This value (root mean squared error) bigger than on train set, meaning that the model might be overfitted (to the training set).

#- optional: predictions are never certain, can you identify the uncertainty of the predictions? (e.g. google predictinterval())

```

[HERE GOES YOUR ANSWER]

### Exercise 2) Model Selection via Cross-validation (N.B: ChildMLU!)

One way to reduce bad surprises when testing a model on new data is to train the model via cross-validation. 

In this exercise you have to use cross-validation to calculate the predictive error of your models and use this predictive error to select the best possible model.

- Use cross-validation to compare your model from last week with the basic model (Child MLU as a function of Time and Diagnosis, and don't forget the random effects!)
- (Tips): google the function "createFolds";  loop through each fold, train both models on the other folds and test them on the fold)

- Now try to find the best possible predictive model of ChildMLU, that is, the one that produces the best cross-validated results.

- Bonus Question 1: What is the effect of changing the number of folds? Can you plot RMSE as a function of number of folds?
- Bonus Question 2: compare the cross-validated predictive error against the actual predictive error on the test data


```{r}
# Cutting the train dataset into a certain amount of slices. Repeat what we have been doing so far. Train on all slices except for one. TEst on that. Save the testing performance. Then do it for the next slice (train and test...). 5 is a standard number of times. Gives a nice measure of how well the model will generalise to new data without using new data. If the five values are very far from each other: you are not doing too good.
# cvms package - does cross validation and calcultation of performance for you. Try to start making own loop, you learn from that - but do not spend too much time doing this... primarily for you to learn.


#- Create the basic model of ChildMLU as a function of Time and Diagnosis (don't forget the random effects!).
m0 <- lmer(CHI_MLU ~ Visit * Diagnosis + (1 + Visit|Child.ID), df_train, REML = F)
summary(m0)

#- Make a cross-validated version of the model. (Tips: google the function "createFolds";  loop through each fold, train a model on the other folds and test it on the fold)
set.seed(1)
folds <- createFolds(unique(df_train$Child.ID), 5)

for (f in folds){
  train <- subset(df_train, !(Child.ID %in% f)) # Train data
  model <- lmer(CHI_MLU ~ Visit * Diagnosis + (1 + Visit|Child.ID), data=train, REML = F, lmerControl(optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
  actual <- na.omit(subset(df_train, (Child.ID %in% f))) # Test data
  predictions <- predict(model, actual, allow.new.levels=T)
  print(ModelMetrics::rmse(actual$CHI_MLU, predictions))
  rmse1 <- c(rmse, na.omit(ModelMetrics::rmse(actual$CHI_MLU, predictions)))
}

#- Report the results and comment on them.
mean(rmse1)

# These are relatively close to each other, this is a good sign. The average of this particular cross-validation = 0.38
sd(na.omit(df_train$CHI_MLU))
# Standard deviation = 0.93. Meaning that the model is a better prediction than the mean of the data.

#- Now try to find the best possible predictive model of ChildMLU, that is, the one that produces the best cross-validated results.
for (f in folds){
  train <- subset(df_train, !(Child.ID %in% f)) # Train data
  model <- lmer(CHI_MLU ~ Visit + Diagnosis + (1 + Visit|Child.ID), data=train, REML = F, lmerControl(optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
  actual <- na.omit(subset(df_train, (Child.ID %in% f))) # Test data
  predictions <- predict(model, actual, allow.new.levels=T)
  print(ModelMetrics::rmse(actual$CHI_MLU, predictions))
  rmse0 <- c(rmse, na.omit(ModelMetrics::rmse(actual$CHI_MLU, predictions)))
}

# Mean
mean(rmse0)

# Bonus Question 1: What is the effect of changing the number of folds? Can you plot RMSE as a function of number of folds?
# Bonus Question 2: compare the cross-validated predictive error against the actual predictive error on the test data
```

[HERE GOES YOUR ANSWER]

### Exercise 3) Assessing the single child

Let's get to business. This new kiddo - Bernie - has entered your clinic. This child has to be assessed according to his group's average and his expected development.

Bernie is one of the six kids in the test dataset, so make sure to extract that child alone for the following analysis.

You want to evaluate:

- how does the child fare in ChildMLU compared to the average TD child at each visit? Define the distance in terms of absolute difference between this Child and the average TD.

- how does the child fare compared to the model predictions at Visit 6? Is the child below or above expectations? (tip: use the predict() function on Bernie's data only and compare the prediction with the actual performance of the child)

```{r}
# Bernie = Child.ID: 2

# Extracting Bernie from the rest of the children
Bernie <-
  df_test %>% 
  filter(Child.ID == 2)

# Checking the mean of the TD's in each visit
TDs <-
  df_train %>% 
  filter(Diagnosis == "TD") %>% 
  group_by(Visit) %>% 
  summarise(mean = mean(CHI_MLU, na.rm = T)) %>% 
  mutate(difference = (Bernie$CHI_MLU-TDs$mean))

# That is, Bernie has longer MLU than the average TD child througout all the visits.

# Bernie compared to predictions for visit 6 - is he below or above expectations?
prediction <- predict(m0, Bernie)
prediction
# Me and Anna get 3.78, the boys get 1.72.

# Bernies actual performance at visit 6
Bernie %>% 
  filter(Visit == 6) %>% 
  select(CHI_MLU)
# 3.45

# That is, following me and Anna's predictions, he does worse than expected. But following Oliver and Malte's predictions, Bernie does better than expected.

```


[HERE GOES YOUR ANSWER]

### OPTIONAL: Exercise 4) Model Selection via Information Criteria
Another way to reduce the bad surprises when testing a model on new data is to pay close attention to the relative information criteria between the models you are comparing. Let's learn how to do that!

Re-create a selection of possible models explaining ChildMLU (the ones you tested for exercise 2, but now trained on the full dataset and not cross-validated).

Then try to find the best possible predictive model of ChildMLU, that is, the one that produces the lowest information criterion.

- Bonus question for the optional exercise: are information criteria correlated with cross-validated RMSE? That is, if you take AIC for Model 1, Model 2 and Model 3, do they co-vary with their cross-validated RMSE?

### OPTIONAL: Exercise 5): Using Lasso for model selection
Welcome to the last secret exercise. If you have already solved the previous exercises, and still there's not enough for you, you can expand your expertise by learning about penalizations. Check out this tutorial: http://machinelearningmastery.com/penalized-regression-in-r/ and make sure to google what penalization is, with a focus on L1 and L2-norms. Then try them on your data!