-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathhof3.Rmd
340 lines (246 loc) · 12.8 KB
/
hof3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Public health data science"
author: "Julian Flowers"
date: "`r Sys.Date()`"
output:
ioslides_presentation:
fig_caption: yes
incremental: yes
smaller: yes
widescreen: yes
bibliography: bib.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache = TRUE)
```
```{r, cache=TRUE}
library(tidyverse)
library(fingertipsR)
data <- fingertips_data(ProfileID = 26,
AreaTypeID = c(101, 102), inequalities = FALSE)
```
## What is public health data science?
* Data science
+ New analytical thinking from describing -> doing
+ New data thinking - big data/ unstructured/ open/ linked/ text/ graph data
+ Mainstreaming novel analytical techniques - text mining and NLP, machine learning/ models/ network analysis/ time series...
+ Blending software engineering, ICT, statistical and digital skills and tools with analysis
* PHDS all this + epidemiology/ ph domain knowledge (topic and data) / population perspective -> applications - monitor (past); surveillance (present); model (future)
## Public Health Intelligence 2.0
## Public health intellgence 2.0
PHI 1.0 (now) | PHI 2.0 (next)
--------------------------------|-------------------------------
Profiling => | Analysis and insight
Collation and description => | Prediction and prescription
Excel/ stats packages => | R/ Python/ PowerBI
Static => | Interactive
Manual => | Automated
Waterfall => | Agile
User feedback => | User need
Epidemiology and stats => | Epidemiology + models + machine learning
Structured/ small data => | Unstructured/ big data
## Reproducibility
* Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone else working independently, whereas reproducing an experiment is called replicating it. <https://en.wikipedia.org/wiki/Reproducibility>
* Requires sharing method, results, data and code
* From an analysts point of view QA is much easier if analysis is made reproducible from the outset, and uses code.
* GDS are promoting the idea of a 'reproducible analytical pipeline' - see for example <https://github.com/ukgovdatascience/eesectorsmarkdown> (using R Markdown)
* This approach reduces the number of steps in the production process, automates production and QA, improves collaboration and is quicker to produce
## Version control
* Reproduciblity requires **version control**
* There are software systems for doing this. PHE uses:
+ Github for external sharing <https://github.com/PublicHealthEngland>
+ Gitlab for internal sharing <https://gitlab.phe.gov.uk>
* Gitlab is available via PHE username and password. The Github account is managed by PHE Digital
* This kind of version control system has a number of features:
+ A database of all your work (known as a repository or repo)
+ A system for storing any changes you make (known as *commits* and *pushes*)
+ Backup of all your work
+ Ability to easily rewind to any point and undo any change
+ Ability for others to make changes (*branching*) and collaborate
+ Tools to publish prototypes, demos, web pages
## Tidy data
* 'Tidy' data is an important data science concept [@Wickham2014]
* It refers to a data format which is normalised i.e. there is one row per observation, one column per value variable, one table per concept and data cells contain only values
* `tidyverse` is a series of *R* packages to help get data into a tidy format
* Tidy data is much easier to share, manage and analyse
* Data which is not tidy is *messy*
* Tidy data is in 'long' format
* `ggplot2` needs data in tidy format
* Wide data is needed for column by column by comparison
## Shiny
* *Shiny* is the web framework for R. It requires **no** knowledge of web languages like HTML or JavaScript
* You can use Shiny to create interactive web apps, interactive documents, "gadgets" which can be embedded in web pages
* It does require some knowledge of R and the syntax for Shiny takes a little learning but its possible to build powerful tools in days rather than weeks or months
* Shiny apps can be hosted
+ externally on Github, shinyapps.io or other sites, or
+ internally on PHE's Shiny server
* Examples:
+ https://github.com/rstudio/shiny-examples
+ http://pct.bike/
## Analytical approaches
* *Summarise-visualise* - a first step in analysing any large dataset is to create summary statistics, and visualise the data to look at distributions (density plots, box plots), looking for missing data
* *Split-apply-combine* - where data is grouped, splitting into groups, applying analysis to each group and recombining the results
## Infant mortality trends - split-apply-combine example
```{r, echo = FALSE, eval = TRUE}
data %>%
## select indicator
filter(IndicatorName == "Infant mortality", AreaType == "Region") %>%
## convert time to numeric
mutate(time = as.numeric(substr(Timeperiod, 1,4))) %>%
## split by Region
group_by(AreaName) %>%
ggplot(aes(time, Value)) +
geom_line() +
geom_ribbon(aes(ymin = LowerCIlimit, ymax = UpperCIlimit), fill = "blue", alpha = 0.3) +
geom_smooth(method = "lm", colour = "red", lty = "dotted", lwd = 0.5 ) +
facet_wrap(~AreaName) +
labs(title = "Trends in infant mortalty",
y = "Infant mortality (deaths per 1000 births)",
x = "Year",
caption = "Source: Fingertips")
```
## Regional infant mortality trends - using split-apply-combine
- Fitting a linear model to each region
- Linear trends fit well to trends in infant mortality
```{r, echo= FALSE, results="asis"}
options(digits = 2)
data %>%
## select indicator
filter(IndicatorName == "Infant mortality", AreaType == "Region") %>%
## convert time to numeric
mutate(time = as.numeric(substr(Timeperiod, 1,4))) %>%
## split by Region
group_by(AreaName) %>%
## fit linear model of infant mortality ##trend for each region and extract key values
do(broom::glance(lm(.$time ~ .$Value))) %>%
## combine results
select(1:3, 6) %>%
knitr::kable()
```
## Machine learning
- Training computers to perform tasks without explicit programming
- In data terms 5 types of problem can be answered with machine learning techniques (algorithms)
+ *Classification* - yes/no outcomes - does some one have a disease? Was an objective achieved?
- The algorithms include logistic regression, neural networks and deep learning, random forests and support vector machines
+ *How much?* - regression type analysis. Algorithms include linear and non-linear regression; penalised regression (e.g. Lasso)
+ *Is it unusual?* Identifying anomalies, unexpected results, outliers. Algorithms include time series analysis, anomaly detection algorithms, regression models
+ *Is there strucutre or pattern in the data?* - this is sometimes called unsupervised machine learning where the analysis is completely data driven. Algortihms clustering and principal components analysis. The previous examples are *supervised* - we train models where we already know the answer, and see how well they apply where we don't
+ *Learning from data* - recommender systems (e.g. Amazon) and reinforcement learning where models are constantly tuned with new information and feedback. Algorithms include `Arules` and `apriori`.
## R and R Studio
* R is a *statistical programming language*
* R Studio is a *development environment* or (IDE)
* Most people use R in the R Studio environment to undertake analyses, write reports, process data etc.
* The latest version of *R* is 3.4 and R Studio 1.0.143 - we want R users to have these installed
* R can be downloaded [here](https://cran.r-project.org/bin/windows/base/R-3.4.0-win.exe)
* R Studio can be downloaded [here](https://download1.rstudio.org/RStudio-1.0.143.exe)
## R Markdown
* *R Markdown* is a format for creating reports
* You can download your data, do your analysis, create your charts or visualisations, write narrative, and publish your document as HTML or Word (or pdf) all in `R Markdown`
* You can share markdown documents for others to work on and collaborate
* These slides are made in R Markdown and the code is available [here]()
## R packages
* When you download R you get the *base* version
* To make the most of R you need to download additional "packages" along the lines of
```{r, eval = FALSE, echo=TRUE}
install.packages("tidyverse")
library(tidyverse)
```
* A list of recommended packages and their uses is available as a [separate document]()
* We recommend for most purposes that packages should only be used if available from [CRAN](https://cran.r-project.org/)
## `fingertipsR`
* Is a package available from CRAN to designed to make it easy to get and reuse data from [Fingertips](https://fingertips.phe.org.uk/api)
* Has 6 main functions:
+ `profiles()` - returns a list of profiles and domains in Fingertips
+ `indicators()` - returns a list of indicators in a profile or set of profiles
+ `fingertips_data()` - returns the data for a single indicator, set of indicators or set of profiles
+ `area_types()` - returns the area types for which data is available
+ `indicator_metadata()` - returns the metadata for indicators or profiles
+ `deprivation()` - returns the deprivation scores for local authorities
## `fingertipsR` (2)
* More information is available [here](https://github.com/PublicHealthEngland/fingertipsR)
* This code will download all the Health Profile data for counties, upper tier and lower tier local authorities. Replacing with `ProfileID = 19` downloads all the PHOF data
```{r, eval = FALSE, echo=TRUE}
library(fingertipsR)
data <- fingertips_data(ProfileID = 26,AreaTypeID = c(101, 102),
inequalities = FALSE)
```
## Like this:
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, results='asis'}
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, results='asis'}
data %>%
sample_n(5) %>%
select(IndicatorName, AreaName, Timeperiod, Value, LowerCIlimit, UpperCIlimit) %>%
knitr::kable()
```
## `ggplot2`
* R enables production of a wide variety of publication quality graphics and maps
* There are 3 'frameworks' for charting:
+ Base plotting
+ Lattice
+ ggplot2
* ggplot2 is obtained by `install.pacakges("ggplot2")` or `install.packages("tidyverse")` and is the preferred framework
* There is a government ggplot2 theme and we are developing a PHE one
## `ggplot2` example
```{r, echo = FALSE}
library(tidyverse)
data %>%
filter(stringr::str_detect(IndicatorName, "Life"), AreaType == "Region") %>%
ggplot(aes(Timeperiod, Value)) +
geom_boxplot(fill = "aliceblue") +
facet_wrap(~AreaName) +
labs(title = "Trends and variation in local authority life expectancy by region:",
subtitle = "Rate of increase slowing but variation reducing",
y = "Years")+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
```
## `ggplot2` example 2 - faceted maps
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(ggmap)
library(geojsonio)
library(gganimate)
library(viridis)
ds <- data %>%
filter(stringr::str_detect(IndicatorName, "Life"), AreaType == "District & UA", Sex == "Male")
shape <- geojson_read("https://opendata.arcgis.com/datasets/686603e943f948acaa13fb5d2b0f1275_3.geojson", what = "sp")
shape <- subset(shape, substr(lad16cd, 1, 1) == "E")
shape1 <- fortify(shape, region = "lad16cd")
shape1 %>%
left_join(ds, by = c("id" = "AreaCode")) %>%
filter(!is.na(Timeperiod)) %>%
ggplot() +
geom_polygon( aes(long,
lat,
group = group,
fill = Value)) +
facet_wrap(~Timeperiod, nrow = 3 ) +
coord_map() +
theme_minimal() +
scale_fill_viridis(direction = -1, name = "Life expectancy") +
labs(title = "Trends in life expectancy",
y = "",
x = "") +
theme(axis.text = element_blank(),
panel.grid = element_blank())
```
## R resources for Public Health Intelligence
* R Packages
+ `phutils` - a collection of useful tools developed by David Whiting at Medway Council
+ `epitools` - epidemiological analysis
+ `surveillance` - tools for surveillance including ones used by PHE communicable disease control
* Blogs and books
+ [R for public health blog](http://rforpublichealth.blogspot.co.uk/)
+ [Population health data science](https://bookdown.org/medepi/phds/)
+ [R 4 Data Science]()
## Learning R
* DataCamp
* Coursera Data science
* the aRt of the possible
* R 4 Data Science
* Ask
* Try
* Google
* Stack Overflow
* PHE questions
|
## References