-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathapi_rmd.Rmd
291 lines (203 loc) · 10.4 KB
/
api_rmd.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: "R markdown report using fingertipsR"
author: "Julian Flowers"
date: "3 April 2017"
output:
html_document:
number_sections: yes
toc: yes
bibliography: md.bib
---
```{r setup, include=FALSE}
knitr::opts_knit$set(root.dir = "~/rmd")
```
```{r, include = FALSE}
knitr::opts_chunk$set(echo = FALSE, cache = TRUE) ## sets up document to display all code and cache the data and code - this makes it quicker to run
```
# Introduction
The Government Digital Service (GDS) is promoting a [new analyical workflow based on R Markdown](https://gdsdata.blog.gov.uk/2017/03/27/reproducible-analytical-pipeline/). R Markdown is a way of writing reports using R statistical software and [RStudio](https://www.rstudio.com/) which combines analysis and reporting in a single document which can be automated, reproduced and output in *html* format or as *word* or *pdf* documents.
The proposal is that the current flow for reporting and creation of output is simplified from something like this....
![](sMMBa2xfksovCZRW-cYJFNA.png)
<br>
to this:
![](spdVp_pexfJNpJjIxNp1rbQ.png)
<br>
The proposed data flow means that documents can be easily prepared in an appropriate format for publication to .gov.uk. The GDS data science team have produced some graphical templates for use on the *gov.uk* platform. This approach cuts down the number of steps involved in creating reports, reduces the risk of error, improves quality assurance, and can be automated to produce multiple reports in one go, or adapted as a template to report on different topics or issues without too much effort.
The `knitr` package greatly facilitates the production of high quality reports in different formats - the schematic below shows the options.
![](knitr-workflow.png)
## Fingertips
[Fingertips](https://fingertips.phe.org.uk) is a major publication platform for Official Statistics in PHE which currently supports a range of visualisation and graphical pdf reports but producing does not support commentary, analysis and interpretation alongside the publication of the statistical data.
We have produced an R package - `fingertipsR` - to facilitate data extraction from the [Fingertips Automated Programming Interface (API)](https://fingertips.phe.org.uk/api).
# Getting started
This report shows how to:
* extract data from the API using the `fingertipsR` package
* report using `rmarkdown`
## R and markdown basics
A good starting point for R Markdown is the [Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf). There are 3 parts to any markdown document:
1. The **header** - this contains important information about the title, author, date of the report, and controls the output format and style of the document.
2. **Standard html** text for commentary
3. **Code chunks** - this runs the analytical R code to import and manipulate data, create analysis and produce visualisations like charts and maps
In addition R code can be run inside the text to produce figures and tables.
R needs additional `packages`[^1] to perform some functions - these have to be loaded before they can be used. For this analysis we will use:
* `fingertipsR`
* `ggplot2`
* `dplyr`
* `readr`
* `govstyle`
The latter is a ggplot2 theme which complies with gov.uk colours and layouts
```{r libraries, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
##library(fingertipsR)
library(readxl)
library(readr)
if(!require('govstyle')) devtools::install_github(repo = "ivyleavedtoadflax/govstyle")
library(govstyle)
```
## Extracting data from Fingertips using the automated programming interface (API)
To do this we will use the`fingertipsR` package, and extract data for teenage conceptions. There are 3 steps:
1. We need to identify an ID number in Fingertips for the teenage conceptions data using the `indicators` function
2. Identify area type codes - we'll use data for lower tier LAs, with regions as a 'parent' using the `area_types` function
3. Extract the data using the `fingertips_data` function
This returns all the relevant data in a 'tidy' data format. [@Wickham2014]
```{r}
library(stringr)
# which indicator ID is teenage pregnancy?
# ind <- indicators()
# ind <- ind[str_detect(ind$IndicatorName, "Rate of conceptions per"),] ## identify relevant indicator ID
#
# areas <- area_types("district") ## Identify area type code
#
# df <- fingertips_data(IndicatorID = 20401,
# AreaTypeID = 101,
# ParentAreaTypeID = 6) ## download the dataset
```
We can check that we have the correct indicator:
```{r message=FALSE, warning=FALSE}
df <- read_csv("~/Downloads/Teenage_pregnancy.zip")
df %>%
glimpse
```
And do some data exploration and filtering to understand the dataset and extract exactly what we need. We'll look at the `CategoryType` variable. This shows that there are 5 different assignments of LAs to deprivation deciles based on the level of disaggregation and the deprivation score.
```{r results = "asis"}
df %>%
select(CategoryType,Category) %>%
unique() %>%
knitr::kable()
```
To plot trends in under 18 conception rates by deprivation decile we need to decide which deprivation classification to choose. We can plot the different options. This shows that for national data the longest time series (1998 - 2014) is only available for IMD2010 scores; data for 2015 is only available for 2014 and 2015. To plot the time series we therefore need to use IMD2010 scores. The rates for categorisation based on couny/UA or districts are similar. The sharp reduction in under 18 conception rates in the most deprived decile since 2007 is evident.
```{r}
df %>%
filter(AreaName =="England" & stringr::str_detect(Category,"Most deprived decile | Least deprived decile")) %>%
ggplot(aes(Timeperiod, Value)) +
geom_point(aes(colour = Category)) +
geom_line(aes(lty = CategoryType)) +
labs(title = "Trend in under 18 conceptions in \nmost deprived decile",
y = "Under 18 conceptions per 100,000")
```
Next we can choose a single area and plot the trend - we'll use England as an example. We need to filter the data to choose an area and in this case we'll used deprivation deciles.
## Plot the data using the `govstyle` theme
We can now plot the data with `ggplot2` and apply the `govstyle` theme.
```{r}
plot <- df %>%
filter(AreaName == "England" & !is.na(Value) & CategoryType == "County & UA deprivation deciles in England (IMD2010)") %>%
ggplot(aes(Timeperiod, Value,colour = Category)) +
geom_line(aes( group = Category)) +
theme_gov() +
expand_limits(y = c(0, 70), x = c(1990, 2015)) +
labs(y = "Teenage pregnancy rate",
x = "Year",
title = "Trends in teenage pregnancy rate by deprivation decile\n1998-2014")
plot +
geom_text(data = df %>% filter( Timeperiod == "1998" & CategoryType == "County & UA deprivation deciles in England (IMD2010)" ),
size = 2,
aes(
label = Category,
hjust = 1,
vjust = 0,
fontface = "bold"
))
```
## Adding commentary
[Commentary can be easily added and the analysis or outputs coded into the text so it can be consistent with the analysis and automatically updated].
> For example:
Under 18 conception rates have fallen substantially since 1998 and the 'gap' between rates the most and least deprived tenths of areas has fallen from `r round(df[df$AreaName == "England" & df$CategoryType == "District & UA deprivation deciles in England (IMD2010)" & df$Category == "Most deprived decile (IMD2010)" & df$Timeperiod == 2008,]$Value[2], 2)` conceptions per 100,000 in 2008 to `r round(df[df$AreaName == "England" & df$CategoryType == "District & UA deprivation deciles in England (IMD2010)" & df$Category == "Most deprived decile (IMD2010)" & df$Timeperiod == 2014,]$Value[2], 2)` in 2014.
## Simple mapping of LA data
To enhance our report we can add maps.
```{r plot trends in LA rates as a series of maps, fig.width=8, cache=TRUE}
library(rgdal)
library(geojsonio)
library(ggmap)
library(ggfortify)
library(viridis)
shape_file <- geojson_read("http://geoportal.statistics.gov.uk/datasets/686603e943f948acaa13fb5d2b0f1275_3.geojson", what = "sp")
seng <- subset(shape_file, substr(lad16cd, 1, 1) == "E")
seng1 <- fortify(seng, region = "lad16cd")
df_la <- df %>%
filter(AreaType == "District & UA" & Timeperiod != 2015)
s2 <- seng1 %>%
left_join(df_la, by = c("id" = "AreaCode"))
g <- ggplot() +
geom_polygon(data = s2,
aes(long, lat,
group = group,
fill = Value)) +
coord_map() +
facet_wrap(~Timeperiod, nrow = 3)
g +
theme_gov() +
theme(axis.text = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(panel.grid = element_blank()) +
labs(x = "", y = "", title = "Trend in under 18 conception rates by local authority") +
theme(legend.position = "bottom") +
viridis::scale_fill_viridis(direction = -1 )
```
## Automation
Let us say we want to create the same plots for every area. This can be achieved with a **for** loop.
```{r}
# ## Single area
#
# df %>%
# filter(AreaName == "Cambridge" & !is.na(Value) & AreaType == "District & UA") %>%
# ggplot(aes(Timeperiod, Value)) +
# geom_line() +
# theme_gov() +
# expand_limits(y = c(0, 70), x = c(1996, 2015)) +
# labs(y = "Teenage pregnancy rate",
# x = "Year",
# title = paste0("Trends in teenage pregnancy rate\n1998-2015: ", "Cambridge")) +
# geom_text(data = df %>% filter( (Timeperiod == "1998"|Timeperiod == "2015") & AreaName == "Cambridge" ),
# size = 3,
# aes(
# label = round(Value,2),
# hjust = 0.5,
# vjust = 0,
# fontface = "bold"))
```
```{r}
## Example areas
areas <- c("Cambridge","East Cambridgeshire", "Fenland", "Blackburn with Darwen" )
for(area in areas){
print(df %>%
filter(AreaName == area & !is.na(Value) & AreaType == "District & UA") %>%
ggplot(aes(Timeperiod, Value)) +
geom_line() +
theme_gov() +
expand_limits(y = c(0, 70), x = c(1997, 2015)) +
labs(y = "Teenage pregnancy rate",
x = "Year",
title = paste0("Trends in teenage pregnancy rate\n1998-2015: ", area)) +
geom_text(data = df %>% filter( (Timeperiod == "1998"|Timeperiod == "2015") & AreaName == area ),
size = 3,
aes(
label = round(Value,2),
hjust = 0.5,
vjust = 0,
fontface = "bold")) +
geom_smooth(lwd = 0.5, lty = "dotted")
)
}
```
# References
[^1]: A package is a set of functions for a specific purpose