-
Notifications
You must be signed in to change notification settings - Fork 5
/
Rcourse_Session2.Rmd
262 lines (181 loc) · 9.18 KB
/
Rcourse_Session2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
title: "Session 2: Modelling/Analysis"
author: "Tanya Flynn"
output: html_document
---
### A note on P values
A p value tests whether your data fits with your null hypothesis, or in other words whether you are likely to get the answer you got by chance alone, it is a statistical tool, not a definitive answer.
**null hypothesis** - your null hypothesis will always be "there is nothing exciting going on"
**alternate hypothesis** - what you actually think is going on
#### Statistical Significance
Convention makes p < 0.05 the threshold for statistical significance
- this value means if your null hypothesis is true you would see a result like yours in 5 out of every 100 tests
- because 5 out of 100 is not often you assume your null hypothesis is not true
- without replication/validation you cannot say whether you are right in this assumption
For instance, if I have a population of 1000 people, 300 have brown hair, 400 have blue eyes and I want to test if having blue eyes makes you more or less likely to have brown hair (for some reason).
- blue eyes and brown hair being *unrelated* is my **null hypothesis**
- blue eyes and brown hair being *related* is my **alternate hypothesis**
- I choose one hundred people randomly from the group
- I would expect to get approximately 12 people with both blue eyes and brown hair
- This expectation assumes that there is nothing dependent between having blue eyes and brown hair
- What if in this population most people with blue eyes don't have brown hair?
- When I pick the people randomly I get less people with both blue eyes and brown hair than I expect, say I get 6
- 12 and 6 are quite different to each other, p = 0.01
- This p value says that if I did this same test 100 times I would only see a result this extreme by chance once.
- That means it is unlikely my **null hypothesis** (of blue eyes/brown hair being unrelated) is correct, so I can reject it and *assume* the **alternate hypothesis** is true.
#### Clinical Significance
Clinical significance is when something has an effect large enough that it has some kind of relevance to real life.
- statistical significance does not rely on clinical significance
- something could have a very different effect to what is expected, but still not have an effect large enough to matter in a clinical setting
Be wary of claiming something is of importance solely on the basis of a p value < 0.05
# Lesson 5
Covered in this lesson:
- chi-squared
- t tests
- correlations
### Chi-squared
- tests whether what you *expect to get* fits with what you *actually got*
- you need either:
- a 2x2 contingency table
- or two data columns with factor variables
2x2 CONTINGENCY
```{r}
eyes_hair = data.frame("Brown"=c(6, 24), "Other"=c(34, 36) )
row.names(eyes_hair) = c("Blue", "Other")
print(eyes_hair)
```
```{r}
x = chisq.test(eyes_hair)
print(x)
x$observed
x$expected
```
TWO FACTOR COLUMNS
```{r}
mydata = read.delim(file = "Tanya_Data.txt", header = TRUE, sep = "\t", stringsAsFactors=FALSE)
y = chisq.test(mydata$SNP, mydata$GOUT)
print(y)
y$observed
round(y$expected, 0)
```
### t tests
- A way to test whether the mean of two variables is different
- this could be between two groups (cases vs controls)
- or this could be in the same group (before or after)
TWO GROUPS
```{r}
t.test(mydata$URATE ~ mydata$GOUT)
```
SAME GROUP
```{r}
vinha = read.delim(file = "Vinha_Data.txt", header = TRUE, sep = "\t", stringsAsFactors=FALSE)
t.test(y = vinha$UricAcid_B, x = vinha$UricAcid_30, paired=TRUE)
```
*if you have data from multiple time points of the same sample your t test needs to be paired - this takes out any between sample variations*
```{r}
t.test(y = vinha$UricAcid_B, x = vinha$UricAcid_30, paired=FALSE)
```
### Correlations
- testing whether two measures are linearly related to each other, or how similar the values in two variables are for each person/data point
- a correlation coefficient ranges from -1 to 1
- [-1] - as one variable goes up the other goes down
- [ 0] - no correlation at all
- [ 1] - as one variable goes up the other one does too
```{r}
cor.test(mydata$STANCESTRY, mydata$GPANCESTRY)
```
# Lesson 6
Covered in this lesson:
- linear regression
- logistic regression
- covariates
- interactions
### Linear regression
- tests whether one measure has a linear relationship to another
- builds an equation representing this relationship
- simple linear regression equation: **y = B~0~ + B~1~x~1~ + error**
**y** - is the response variable, what you want to predict
**B~0~** - the intercept of the line
**B~1~** - the slope of the line
**x~1~** - the explanatory variable
```{r}
simple = lm(formula = URATE ~ BMI, data = mydata)
summary(simple)
confint(simple)
```
- Estimate = a beta value and centres around 0
- [< 0] - as the explanatory variable goes up the response variable goes down
- [ 0 ] - no relationship at all
- [> 0] - as the explanatory variable goes up the response variable does too
- if the confidence interval of the estimate crosses zero there is no significant relationship between your two variables, *check your p value and confidence interval are saying the same thing!!*
- you can plot this information easily to help yourself see the relationship
```{r}
plot(x = mydata$BMI, y = mydata$URATE, xlab = "BMI (kg/m^2)", ylab = "Serum Urate (mmol/L)")
abline(simple, col = "red" )
```
- multiple linear regression **y = B~0~ + B~1~x~1~ + B~2~x~2~ + ... + B~n~x~n~ + error**
- when you have more than one explanatory variable
```{r}
multiple = lm(formula = URATE ~ BMI + AGE, data = mydata)
summary(multiple)
confint(multiple)
```
- notice that the estimate for BMI is now slightly different from before: 0.0044121 vs 0.0046408
- this is because a multiple regression adjusts the answer of every variable by the effect of every other variable
- sometimes this adjustment can result in a *loss* of significance for a variable
- sometimes this adjustment can result in a *gain* of significance for a variable
### Logistic regression
- tests whether a variable influences an outcome
- works in a similar way to linear regression, but with a categorical variable as the response variable
- simple logistic regression equation: **log(yes/no) = B~0~ + B~1~x~1~**
**log(yes/no)** - is the odds of success, where yes and no are the presence or absence of the outcome you are interested in
**B~0~** - the intercept of the line
**B~1~** - the slope of the line
**x~1~** - the explanatory variable
```{r, error=TRUE}
simpleLOG = glm(formula = GOUTAFFSTAT ~ BMI, data = mydata, family = binomial)
```
```{r}
mydata$GOUTAFFSTAT = mydata$GOUTAFFSTAT - 1
```
```{r}
simpleLOG = glm(formula = GOUTAFFSTAT ~ BMI, data = mydata, family = binomial)
summary(simpleLOG)
```
- the estimates produced by a *glm* are just like those produced by a *lm*
- because our response variable is the log(yes/no) we need to convert our estimates from the log scale back to the response scale
- this is how we get an **odds ratio**
```{r}
exp(coef(simpleLOG) )
exp(confint(simpleLOG) )
```
- exp(Estimate) = the odds ratio and centres around 1
- [< 1] - as the explanatory variable goes up the chance of "yes" happening goes down: protective response
- [ 1 ] - no relationship at all
- [> 1] - as the explanatory variable goes up the chance of "yes" happening goes up: risk response
- if the confidence interval of the exp(estimate) crosses 1 there is no significant relationship between your two variables, *check your p value and confidence interval are saying the same thing!!*
- you can also run multiple logistic regression just the same as in a linear model
- you can also run regressions on only a subset of data within the regression command
```{r}
Euro_multiple = glm(formula = GOUTAFFSTAT ~ BMI + AGE + SNP, data = mydata, subset = ETHCLASS == "European")
summary(Euro_multiple)
exp(coef(Euro_multiple) )
exp(confint(Euro_multiple) )
```
### Interactions
- testing whether the effect of one variable is modified by another variable
- for instance a gene increases gout risk 2-fold in a population.
- eating pies also increases gout risk by 2-fold.
- you would assume having the gene and eating pies would increase your risk ~ 4-fold
- when you test this interaction it turns out eating pies and having the gene decreases your risk 1.5-fold
- what is going on? I dunno, but pies are delicious.
- you can test for interactions by adding an interaction term to your regression model
```{r}
interaction = glm(formula = GOUTAFFSTAT ~ BMI*SEX, data = mydata, family = binomial)
summary(interaction)
exp(coef(interaction))
exp(confint(interaction))
```