rladies_paris_ggplot.Rmd

---
title: "RLadies ggplot2 - final"
author: "Sarah Hosking"
date: "March 8, 2017"
output:
  html_document: 
    toc: yes
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(lubridate)
library(tidyverse)
library(GGally)

library(reshape2) #plot missing vals

theme_set(theme_bw())

```

# Why ggplot?

Big advantages:

* consistent structure regardless of plot type
* easy to iterate
* *excellent grouping and faceting*
* requires tidy data
* plots look pretty good OOTB (albeit with room for improvement)


Some complaints:

* sometimes can be verbose
* slow?
* not as good for print publication - ?


# Paris Air Pollution

Because we all need a break from car, iris and most of all, election, data.

More seriously, it was because last December I found myself repeatedly wondering:

* So just how bad is Paris' air pollution, really?
and
* Is there a bad time to go running?

## Load Data

```{r}

airparif <- read_rds('airparif.rds')
dim(airparif)
```

```{r}
str(airparif)
```

```{r}
head(airparif)
```

```{r}
tail(airparif)
```

```{r}
summary(airparif)
```


## What are we looking at?


```{r}

names(airparif)

```

What the abbreviations mean:

* `PM25` = fine particulate < 2.5 mm
* `PM10` = fine particulate < 10 mm
* `03` = Ozone
* `NO2` = Nitrogen Dioxide (Azote)
* `CO` = Carbon monoxide


# Explore distributions

### DEMO: Histogram

```{r}

ggplot(data = airparif, aes(x = PM10)) +
  geom_histogram(binwidth = 5)

```


You can also use expressions for your data
```{r}
ggplot(data = airparif, aes(x = PM10 + PM25)) +
  geom_histogram(binwidth = 5)

```

### HANDS ON: Examine the other pollutants

Explore `PM25`, `NO2`, `O3`, `CO`. Anything stand out?

```{r explore distributions}

# basic ggplot template for univariate plots

# ggplot(df, aes(x)) +
#  geom_XXX()

p <- ggplot(airparif, aes(NO2)) + 
  geom_histogram(binwidth = 5)
p


```

```{r add vert line}

# "low" levels are > 25. Mark with a vertical line
p +
  geom_vline(xintercept = 25, colour = 'lightgreen')
```

### DEMO: How does the distribution change over the year?

To find out let's facet by month.

```{r}

p +
  geom_vline(xintercept = 25, colour = 'lightgreen') + 
  facet_wrap(~month)

```


### HANDS ON: What do the boxplots look like? 

```{r}
p <- ggplot(airparif, aes(month, NO2))

p +
  geom_boxplot()

```

#### Including scales

Say you want to look more closely at the boxes. 
While `scale_y_continuous` allows you to set limits on the y-axis, take a look at what it does.

Note the boxes and where the outliers start for June, July and August in the original boxplot above. July's outliers start lower than June's, and June's start lower than August's. 

Compare that with the following example. Also...

* compare the proportions of upper and lower halves of September's box.
* compare the overall box size of different months.

```{r avoid this in statistical plots}
 
p +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 60))

```

Notice how the proportions and positions of the boxes have changed. That's because we've removed all the points above 60.

Instead, use `coord_cartesian()`. This creates a true zoom that does not change the underlying data being used to calculate the stats.

```{r do this instead}

p + 
  geom_boxplot() + 
  coord_cartesian(ylim = c(0,60))

```


# Adding more variables to plots

To segment your plot into different groups, you can specify a variable for color, shape, linetype etc.

### DEMO: Add a grouping variable

What else happens when you include a "grouping" variable in the aesthetic layer?

```{r easy iteration + grouping}

# create base data layer with a single-variable aesthetic
p <- ggplot(data = airparif, aes(x = PM25))

# make a frequency polygon
p + geom_freqpoly()

# change the colour used by the freqpoly
p + geom_freqpoly(colour = 'blue')

# add a new variable to the aesthetic layer, 
# by assigning it to the color property
p + geom_freqpoly(aes(colour = year))


```

### HANDS ON: show frequency polygons for each year in the same plot.

Hint: use aesthetics
The aesthetics available for geom_freqpoly are:

* alpha

* colour

* group

* linetype

* size

```{r hands on}

# group by year
# what happens on the chart
p + geom_freqpoly(aes(linetype = year))

```

With this type of plot, facets would be better.

### HANDS ON: Facet the previous plot by month.

This will create a grid of plots, one for each month.

```{r facet previous plot by month}

p +
  geom_freqpoly(aes(linetype = month)) +
  facet_wrap(~ month)
```

Admittedly, using `linetype` to denote 12 different values is not the most effective visualization method! 

But the point of this demo was to show first,

* that you can display *multiple groups in a single plot* by mapping a *new variable to the `aes()` layer*.

and second, 

* that you can display *multiple plots from the same code*, by adding a *`facet_grip(~faceting_var)` layer*.

# Look at correlations

### DEMO: Plot a correlation matrix

```{r ggpairs, eval=FALSE, warning=FALSE}

# first, create a sample since ggpairs takes time.

# for reproducibility, set a seed to have the same rows
# in sample every time
set.seed(888)

# limit the number of columns
sample <- subset(airparif, select = c(PM10, PM25, NO2, O3, CO, month))
# limit the number of rows
sample <- sample[sample(1:nrow(sample),1000),]

# create simple correlation matrix plot
ggpairs(data=sample,
        columns=1:3)

# include all variables (incl. categorical)
# add a title
ggpairs(data=sample,
        #columns=1:3
        title="Pollution Correlation Matrix")


# continuous plots: add smoothing & modify symbol
# combo plots: adjust # of bins
ggpairs(data=sample,
        #columns=1:3
       lower = list(continuous = wrap("smooth", alpha=1/5, shape = I('.')),
                     combo = wrap("facethist", binwidth = 5)), 
      title="Pollution Correlation Matrix")
```

# Summarize and plot

How polluted is Paris? How many days were there of high-level pollution?

To do this, we'll need to calculate the daily mean, median and max for each pollutant. 

*This involves creating a tidy data frame using dplyr*


<explain tidy data>


```{r}

# Option 1: load the summarized data
#airparif.long <- read_rds("airparif_long.rds")


# Option 2: summarize data yourself

airparif.long <- airparif %>% 
   # replace these column names with 
   gather(PM10,PM25,NO2,O3,CO, 
         key = 'pollutant', # a single 'pollutant' column
         value = 'value', # and store their values in a single 'values' column
         na.rm = TRUE) %>% # remove NA values
   group_by(date, pollutant) %>% # group by date and pollutant type, then
   summarise('mean' = mean(value), # add summary columnss
             'median' = median(value), 
             'max' = max(value))

head(airparif.long)

```

Now plot this new df

```{r}

ggplot(aes(date, median, colour = pollutant), 
       data = airparif.long) +
         geom_line()

```

Wow, carbon monoxide (`CO`) looms over everything else.

For now, let's only examine fine particulate

```{r}

# only plot PM pollutants
particles <- airparif.long %>% 
  filter(pollutant %in% c('PM10', 'PM25')) 


p <- ggplot(particles, aes(date, median))

p + 
  geom_line(aes(y = median, colour = pollutant)) + 
  scale_x_date(date_breaks = '3 months', date_labels = "%b %d")
  
```

Need to add some more context to this plot. What are the levels for medium, high, and very high pollution?


## How many days at each pollution level?

** 

```{r}

# Option 1: load the summarized data
#levels <- read_rds("levels.rds")


# Option 2: summarize data yourself

# First, load function to tag pollution levels


# NOTE: in reality, these cut-offs ONLY APPLY TO PM10
# TO DO: add condition to check for pollutant type
pollution.level <- function(x) {
 
 labs <- c("very low", "low", "medium", "high", "very high")

 cut(x, 
     breaks = c(0,15,30,50,100,999), 
     right = TRUE,
     #include.lowest = TRUE, #new setting!
     labels = labs)
}

# Then complete this code
levels <-  airparif.long %>%
  group_by(date, pollutant) %>% 
  #mutate_all(funs('level' = pollution.level(.))) %>% 
  mutate_all(funs(pollution.level))

```


### DEMO: Plot days per level per pollutant

How many days at each level for each pollutant?

```{r}

# create a bar plot
p <- ggplot(levels, aes(median, fill = pollutant))
p + geom_bar()

```

Make this easier to read

```{r}
p + 
  geom_bar(position = 'dodge')

```

Still hard to compare the distribution of levels for each pollutant. 

### HANDS ON: How would you improve this?

We can facet by pollutant.

```{r}
p + 
  geom_bar(position = 'dodge') +
  facet_wrap(~pollutant)

```

### DEMO: Does this change much from year to year? 

We'll need to add a year column.
We can do this again with dplyr and include ggplot using the pipe

```{r add year var to levels df}

levels %>% 
  # separate(date, c("year", "month", "day"), sep="-") %>% 
  mutate('year' = year(date)) %>% 
  ggplot(aes(median, fill = pollutant)) + 
  geom_bar(position = 'dodge') +
  facet_grid(pollutant ~ year)

```

It doesn't look like it. But if we allow the y-scale to vary, it might give us more insight.

```{r free y-axis scale}

levels %>% 
  # separate(date, c("year", "month", "day"), sep="-") %>% 
  mutate('year' = year(date)) %>% 
  ggplot(aes(median, fill = pollutant)) + 
  geom_bar(position = 'dodge') +
  facet_grid(pollutant ~ year, scales = 'free_y')


```


### HANDS ON: What about by month?

You'll need to add month, similar to adding year above. Dplyr has a function to make this easy.

BONUS: Look up dplyr's `separate()` function in the help.

```{r }

levels %>% 
  separate(date, c("year", "month", "day"), sep="-") %>% 
  ggplot(aes(median, fill = pollutant)) + 
  geom_bar(position = 'dodge') +
  facet_grid(pollutant ~ month)

```

Hard to tell, let's free up the y scale for each pollutant

### HANDS ON: Make the y-axis scale flexible

HINT: Look up `facet_grid` in the help, and look for `scales`.

```{r}

levels %>% 
  separate(date, c("year", "month", "day"), sep="-") %>% 
  ggplot(aes(median, fill = pollutant)) + 
  geom_bar(position = 'dodge') +
  facet_grid(pollutant ~ month, scales = "free_y")

```

### DEMO: Make plot more readable

Lastly, let's fix those x-axis labels

```{r fix axis labels}

p <- levels %>% 
  separate(date, c("year", "month", "day"), sep="-") %>% 
  ggplot(aes(median, fill = pollutant)) + 
  geom_bar(position = 'dodge') +
  facet_grid(pollutant ~ month, scales = "free_y")

p +
  scale_y_continuous(breaks = seq(0,150, 50)) +
  scale_x_discrete(name = "Pollution Level") +
  theme(axis.text.x = element_text(size = 4, angle = -45, hjust = -0.1))
  

```


# BONUS STUFF (If we have time)

## qplot

`qplot` is part of `ggplot2`, and stands for "quick plots". Its syntax is similar to base plotting, and is less verbose than `ggplot()`.

```{r qplot hists}

qplot(PM10, data = airparif)

# increase bins
qplot(data = airparif, PM10, binwidth = 5)

```


It's meant, however, to work with vectors. Here's a demo.

```{r}

# qplot meant to work with vectors
a <- c('A', 'B', 'C')
b <- c(1, 2, 3)

# qplot
q <- qplot(a,b)
q

```


```{r}

# ggplot needs a dataframe
d <- data.frame(a,b)

# ggplot
p <- ggplot(data = d, aes(a,b)) +
  geom_point()
p


```

But compare these plots if you change a variable.

```{r}

# change var
a <- c('X','Y','Z')

# with qplot
q
```

```{r}

# with ggplot
p
```

## Visualize missing values with ggplot

```{r}
# credit to:
# http://www.njtierney.com/r/missing%20data/rbloggers/2015/12/01/ggplot-missing-data/
missing.data <- function(x){
  
  x %>% 
    is.na %>%
    melt %>%
    ggplot(data = .,
           aes(x = Var2,
               y = Var1)) +
    geom_raster(aes(fill = value)) +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme_minimal() + 
    theme(axis.text.x  = element_text(angle=45, vjust=0.5)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")
}

#missing.data(airparif[,1:7])
```

# Takeaways

Compare ggplot2 to R base plots 
https://flowingdata.com/2016/03/22/comparing-ggplot2-and-r-base-graphics/
http://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/
http://varianceexplained.org/r/why-I-use-ggplot2/

qplot or ggplot?
http://stackoverflow.com/questions/5322836/choosing-between-qplot-and-ggplot-in-ggplot2

grouping vs aesthetics
http://stackoverflow.com/questions/10357768/plotting-lines-and-the-group-aesthetic-in-ggplot2

Ggplot cheatsheet: https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf