A2_P1_LanguageASD_instructions.Rmd

---
title: "Assignment 2 - Language Development in ASD - Part 1 - Explaining development"
author: "Malte Højmark-Bertelsen"
date: "12/09/19"
output: 
  md_document:
    variant: markdown_github
---
    
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(include = FALSE)
```

# Assignment 2

In this assignment you will have to discuss a few important questions (given the data you have). More details below. The assignment submitted to the teachers consists of:
- a report answering and discussing the questions (so we can assess your conceptual understanding and ability to explain and critically reflect)
- a link to a git repository with all the code (so we can assess your code)

Part 1 - Basic description of language development
- Describe your sample (n, age, gender, clinical and cognitive features of the two groups) and critically assess whether the groups (ASD and TD) are balanced
- Describe linguistic development (in terms of MLU over time) in TD and ASD children (as a function of group). 
- Describe how parental use of language (in terms of MLU) changes over time. What do you think is going on?
- Include individual differences in your model of language development (in children). Identify the best model.

Part 2 - Model comparison
- Discuss the differences in performance of your model in training and testing data
- Which individual differences should be included in a model that maximizes your ability to explain/predict new data?
- Predict a new kid's performance (Bernie) and discuss it against expected performance of the two groups

Part 3 - Simulations to plan a new study
- Report and discuss a power analyses identifying how many new kids you would need to replicate the results

The following involves only Part 1.

## Learning objectives

- Summarize and report data and models
- Critically apply mixed effects (or multilevel) models
- Explore the issues involved in feature selection


# Quick recap
Autism Spectrum Disorder is often related to language impairment. However, this phenomenon has not been empirically traced in detail:
i) relying on actual naturalistic language production,  ii) over extended periods of time.

We therefore videotaped circa 30 kids with ASD and circa 30 comparison kids (matched by linguistic performance at visit 1) for ca. 30 minutes of naturalistic interactions with a parent. We repeated the data collection 6 times per kid, with 4 months between each visit. We transcribed the data and counted: 
i) the amount of words that each kid uses in each video. Same for the parent.
ii) the amount of unique words that each kid uses in each video. Same for the parent.
iii) the amount of morphemes per utterance (Mean Length of Utterance) displayed by each child in each video. Same for the parent. 

This data is in the file you prepared in the previous class. 

NB. A few children have been excluded from your datasets. We will be using them next week to evaluate how good your models are in assessing the linguistic development in new participants.

This RMarkdown file includes 
1) questions (see above). Questions have to be answered/discussed in a separate document that you have to directly send to the teachers.
2) A break down of the questions into a guided template full of hints for writing the code to solve the exercises. Fill in the code and the paragraphs as required. Then report your results in the doc for the teachers.

REMEMBER that you will have to have a github repository for the code and send the answers to Kenneth and Riccardo without code (but a link to your github/gitlab repository). This way we can check your code, but you are also forced to figure out how to report your analyses :-)

Before we get going, here is a reminder of the issues you will have to discuss in your report:

1- Describe your sample (n, age, gender, clinical and cognitive features of the two groups) and critically assess whether the groups (ASD and TD) are balanced
2- Describe linguistic development (in terms of MLU over time) in TD and ASD children (as a function of group). 
3- Describe how parental use of language (in terms of MLU) changes over time. What do you think is going on?
4- Include individual differences in your model of language development (in children). Identify the best model.

# Let's go

### Loading the relevant libraries

Load necessary libraries : what will you need?
- e.g. something to deal with the data
- e.g. mixed effects models
- e.g. something to plot with

```{r Load Libraries, include = FALSE}
library(pacman)
p_load(tidyverse, lme4, lmerTest)
```

### Define your working directory and load the data
If you created a project for this class and opened this Rmd file from within that project, your working directory is your project directory.

If you opened this Rmd file outside of a project, you will need some code to find the data:
- Create a new variable called locpath (localpath)
- Set it to be equal to your working directory
- Move to that directory (setwd(locpath))
- Load the data you saved last time (use read_csv(fileName))

```{r Load Data, include = FALSE}
df <- read_csv("LangDevASD.csv")
df <- df[,-1]
```

### Characterize the participants (Exercise 1)

Identify relevant variables: participants demographic characteristics, diagnosis, ADOS, Verbal IQ, Non Verbal IQ, Socialization, Visit, Number of words used, Number of unique words used, mean length of utterance in both child and parents.

Make sure the variables are in the right format.

Describe the characteristics of the two groups of participants and whether the two groups are well matched.

```{r descriptive stats, include = FALSE}
summary(df)
str(df)
#changing variables into the correct classes
df$SUBJ <- as.factor(df$SUBJ)
df$DIAGNOSIS <- as.factor(df$DIAGNOSIS)
df$VISIT <- as.integer(df$VISIT)
df[df$VISIT == 1,] %>% 
  group_by(DIAGNOSIS) %>% 
  dplyr::summarise(
    N = n(),
    meanAGE = mean(AGE, na.rm = T),
    sdAGE = sd(AGE, na.rm = T),
    femaleN = sum(GENDER == "F"),
    verbalIQ = mean(EXPRESSIVELANGRAW1, na.rm = T),
    sdverbalIQ = sd(EXPRESSIVELANGRAW1, na.rm = T),
    nonverbalIQ = mean(MULLENRAW1, na.rm = T),
    sdnonverbalIQ = sd(MULLENRAW1, na.rm = T)
  )

```

The sample included mostly young (<20) white males ...

[REPORT THE RESULTS]

## Let's test hypothesis 1: Children with ASD display a language impairment  (Exercise 2)

### Hypothesis: The child's MLU changes: i) over time, ii) according to diagnosis

Let's start with a simple mixed effects linear model

Remember to plot the data first and then to run a statistical test.
- Which variable(s) should be included as fixed factors?
- Which variable(s) should be included as random factors?

```{r ex2, include = FALSE}
ggplot(df, aes(VISIT, CHI_MLU, color = DIAGNOSIS))+
  geom_point()+
  geom_smooth(method = lm)+
  ylab("Children Mean Length Utterances")+
  xlab("Visit")+
  labs(color = "Diagnosis")
#seems as if the TD childs has a better language development over time

h1 <- lmer(CHI_MLU ~ VISIT * DIAGNOSIS + (1+VISIT|SUBJ), data = df, REML = F) #We expect that the development per visit is different and that each subject develops differently, hence the random slopes and intercepts.
summary(h1)

```

How would you evaluate whether the model is a good model?

```{r ex2 evaluate, include = FALSE}
model_null <- lmer(CHI_MLU ~ VISIT + DIAGNOSIS + (1+VISIT|SUBJ), data = df, REML = F)
summary(model_null)
anova(model1, model_null) #logLik model 2 = -295.71, Loglik model 2 = -278.23, difference in 17.48. It is higher and significant therefore it seems that model 1 is better. The true model (model1) is better at explaining the variance of the data. 

```

Not too good, right? Let's check whether a growth curve model is better.
Remember: a growth curve model assesses whether changes in time can be described by linear, or quadratic, or cubic (or... etc.) components.
First build the different models, then compare them to see which one is better.

```{r ex2 growth curve, include = FALSE}

```

Exciting right? Let's check whether the model is doing an alright job at fitting the data. Plot the actual CHI_MLU data against the predictions of the model fitted(model). 

```{r}
df2 <- df[-which(is.na(df$CHI_MLU)),] #removing NAs
plot(fitted(model1), df2$CHI_MLU, col=c("red", "blue"))
#Difficult to say anything else than it seems as if they are following each other. Doing correlation test to see the coeffecient.
cor.test(fitted(model1), df2$CHI_MLU) #0.92 - seems very correlated
```

Now it's time to report our results.
Remember to report:
- the estimates for each predictor (beta estimate, standard error, p-value)
- A plain word description of the results
- A plot of your model's predictions (and some comments on whether the predictions are sensible)

[REPORT THE RESULTS]
Linguistic development of children MLU is affected by ... [COMPLETE]

## Let's test hypothesis 2: Parents speak equally to children with ASD and TD  (Exercise 3)

### Hypothesis: Parental MLU changes: i) over time, ii) according to diagnosis

```{r ex3, include = FALSE}
h2.0 <- lmer(MOT_MLU ~ VISIT + DIAGNOSIS + (1+VISIT|SUBJ), df2, REML = F)
h2.1 <- lmer(MOT_MLU ~ VISIT * DIAGNOSIS + (1+VISIT|SUBJ), df2, REML = F, control = lmerControl(optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
anova(h2.0, h2.1)
#Since the model with an interaction effect is not better we will just use the one without them.

summary(h2.0)
#Report this
```

Parent MLU is affected by ... but probably not ...
[REPORT THE RESULTS]

### Adding new variables (Exercise 4)

Your task now is to figure out how to best describe the children linguistic trajectory. The dataset contains a bunch of additional demographic, cognitive and clinical variables (e.g.verbal and non-verbal IQ). Try them out and identify the statistical models that best describes your data (that is, the children's MLU). Describe how you selected the best model and send the code to run the model to Riccardo and Kenneth


```{r ex4, include = FALSE}
# Baseline model
m0 <- lmer(CHI_MLU ~ 1 + (1 + VISIT | SUBJ), df2, REML = F, control = lmerControl(optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
MuMIn::r.squaredGLMM(m0) #0 & 0.8506

# Verbal IQ
m1 <- lmer(CHI_MLU ~ VISIT * DIAGNOSIS * EXPRESSIVELANGRAW1 + (1 + VISIT|SUBJ), df2, REML = F)
summary(m1)
# R^2
MuMIn::r.squaredGLMM(m1) #0.6740 & 0.8138

# Non-verbal IQ
m2 <- lmer(CHI_MLU ~ VISIT * DIAGNOSIS * MULLENRAW1 + (1 + VISIT|SUBJ), df2, REML = F, lmerControl(optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
MuMIn::r.squaredGLMM(m2) #0.5098 & 0.8153

# Socialisation
m3 <- lmer(CHI_MLU ~ VISIT * DIAGNOSIS * SOCIALIZATION1 + (1 + VISIT|SUBJ), df2, REML = F)
MuMIn::r.squaredGLMM(m3) #0.4959 & 0.8161

# ADOS - indicates the severity of the autistic symptoms
m4 <- lmer(CHI_MLU ~ VISIT * DIAGNOSIS * ADOS1 + (1 + VISIT|SUBJ), df2, REML = F)
MuMIn::r.squaredGLMM(m4) #0.5001 & 0.8157

# Gender
m5 <- lmer(CHI_MLU ~ VISIT * DIAGNOSIS * GENDER + (1 + VISIT|SUBJ), df2, REML = F, lmerControl(optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
MuMIn::r.squaredGLMM(m5) #0.3973 & 0.8148

anova(m0, m1, m2, m3, m4, m5)
```

In addition to ..., the MLU of the children is also correlated with ...
Using AIC / nested F-tests as a criterium, we compared models of increasing complexity and found that ...

[REPORT THE RESULTS]