Jason Liu and Clayton Halim 9/5/2017
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.
I'm going to try to keep the buzzword density low here.
Data science, oversimplified, can be though of as two classes of work, Algorithms and Analytics. While analytics may not be doing algorithm design or complex modeling, those who are implementing these algorithms on real data often find themselves analyzing data. They need to understand the biases their models have and confirm that the data is approporiate for the model.
From my own perspective...
Analytics: Studying the business's data and making recommendations, understanding experiments to improve operations and product. This type of data science is about transforming businesses using insights.
Algorithms: Using machine learning and statistics to build tools. Here, the service or model is the product.
Even within the realm of algorithms and machine learning, to make best models, we need to understand the requirements of the data for models to succeed and to confirm these models. For example, take the Anscombe's Quartet.
It turns out that the quantiles, correlation, r squared of a linear model on these data are all the same. This is a common pitfall that occur when we try modeling without... practising safe statistics.
Happy families are all alike; every unhappy family is unhappy in its own way.
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
- A single observational unit is stored in multiple tables.
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset
The five most common problems with messy datasets, along with their remedies:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
The text above came from http://www.jstatsoft.org/v59/i10/paper, the original tidy data paper.
The value will become more and more apparent when doing transformations and visualizations on datasets that are more complex.
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
Notice that the in this table the column names are the years and the variable that represents the values are unclear. We can use the tidyr::gather
command to reshape and tidy the data.
table4b %>%
gather(year, population, 2:3)
## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
The opposite of gather is spread, which is often helpful for human readability.
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
Here we just need to pass in the column name that will be converted into rows and the value.
table2 %>%
spread(type, count)
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
This would be usefull if we want to compute a new value such as rate=cases/population
There are also functions called unite
and seperate
that I encourage you to learn these on your own. Try ??seperate
in the command line. These will split one column into many columns and vise versa.
These functions are useful for cases when one column contains many variables, for example when a column has values such as male_treatmenta
we might want to seperate this value on _
and transform into a gender
and treatment_type
Data transformation is a process useful for converting one data format to another. You will later see things like filtering data based on the required conditions, creating new columns, etc.
Source: dplyr tutorial
dplyr is a powerful R-package to transform and summarize tabular data with rows and columns.
The package contains a set of functions (or “verbs”) that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.
In addition, dplyr contains a useful function to perform another common task which is the “split-apply-combine” concept. We will discuss that in a little bit.
: select columnsfilter()
: filter rowsarrange()
: re-order or arrange rowsmutate()
: create new columnssummarise()
: summarise valuesgroup_by()
: allows for group operations in the “split-apply-combine” concept
Quick look at the first 5 entires in the data
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p
## # ... with 1 more variables: class <chr>
You can select certain columns by specifying the dataset and the columns you want to include afterwards
car_models <- select(mpg, manufacturer, model, year)
## # A tibble: 6 x 3
## manufacturer model year
## <chr> <chr> <int>
## 1 audi a4 1999
## 2 audi a4 1999
## 3 audi a4 2008
## 4 audi a4 2008
## 5 audi a4 1999
## 6 audi a4 1999
You can select all but a certain column by using the -
(subtraction) operation, aka negative indexing.
head(select(car_models, -year))
## # A tibble: 6 x 2
## manufacturer model
## <chr> <chr>
## 1 audi a4
## 2 audi a4
## 3 audi a4
## 4 audi a4
## 5 audi a4
## 6 audi a4
You can also select a range of columns using :
head(select(mpg, manufacturer:year))
## # A tibble: 6 x 4
## manufacturer model displ year
## <chr> <chr> <dbl> <int>
## 1 audi a4 1.8 1999
## 2 audi a4 1.8 1999
## 3 audi a4 2.0 2008
## 4 audi a4 2.0 2008
## 5 audi a4 2.8 1999
## 6 audi a4 2.8 1999
Some additional options to select columns based on a specific criteria include
= Select columns that end with a character stringcontains()
= Select columns that contain a character stringmatches()
= Select columns that match a regular expressionone_of()
= Select columns names that are from a group of names
works by passing in the dataset and giving the columns of interest a condition to pass.
For example, all cars made by Audi in 1999:
filter(mpg, year == 1999, manufacturer == "audi")
## # A tibble: 9 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.8 1999 6 auto(l5) f 16 26
## 4 audi a4 2.8 1999 6 manual(m5) f 18 26
## 5 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 6 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 7 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25
## 8 audi a4 quattro 2.8 1999 6 manual(m5) 4 17 25
## 9 audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24
## # ... with 2 more variables: fl <chr>, class <chr>
Before we go any futher, let’s introduce the pipe operator: %>%
. dplyr imports this operator from another package (magrittr). This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.
If I wanted to see select car models that are made after 2005, have more than 20 city miles per gallon, and see only the first 3 entries, I could do it like:
head(select(filter(mpg, year < 2005, cty > 20), model), 3)
## # A tibble: 3 x 1
## model
## <chr>
## 1 a4
## 2 civic
## 3 civic
As you can see this is really hard to read, but with pipes we can get the same result in a cleaner fashion.
mpg %>%
filter(year < 2005, cty > 20) %>%
select(model) %>%
## # A tibble: 3 x 1
## model
## <chr>
## 1 a4
## 2 civic
## 3 civic
You can sort your rows in ascending order by any combination of columns using arrange.
mpg %>%
arrange(displ) %>%
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 honda civic 1.6 1999 4 manual(m5) f 28 33 r
## 2 honda civic 1.6 1999 4 auto(l4) f 24 32 r
## 3 honda civic 1.6 1999 4 manual(m5) f 25 32 r
## 4 honda civic 1.6 1999 4 manual(m5) f 23 29 p
## 5 honda civic 1.6 1999 4 auto(l4) f 24 32 r
## 6 audi a4 1.8 1999 4 auto(l5) f 18 29 p
## # ... with 1 more variables: class <chr>
Use desc()
to get descending order.
mpg %>%
arrange(desc(displ), year) %>%
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int>
## 1 chevrolet corvette 7.0 2008 8 manual(m6) r 15
## 2 chevrolet k1500 tahoe 4wd 6.5 1999 8 auto(l4) 4 14
## 3 chevrolet corvette 6.2 2008 8 manual(m6) r 16
## 4 chevrolet corvette 6.2 2008 8 auto(s6) r 15
## 5 jeep grand cherokee 4wd 6.1 2008 8 auto(l5) 4 11
## 6 chevrolet c1500 suburban 2wd 6.0 2008 8 auto(l4) r 12
## # ... with 3 more variables: hwy <int>, fl <chr>, class <chr>
The mutate()
function will add new columns to the data frame. We can create a new column apg
, that is average miles per galon.
mpg %>%
mutate(apg = (cty + hwy) / 2) %>%
select(manufacturer, model, year, cty, hwy, apg) %>%
## # A tibble: 6 x 6
## manufacturer model year cty hwy apg
## <chr> <chr> <int> <int> <int> <dbl>
## 1 audi a4 1999 18 29 23.5
## 2 audi a4 1999 21 29 25.0
## 3 audi a4 2008 20 31 25.5
## 4 audi a4 2008 21 30 25.5
## 5 audi a4 1999 16 26 21.0
## 6 audi a4 1999 18 26 22.0
You can add more than one column by separating the variables by comma in the function parameters.
The summarise()
function will create summary statistics for a given column in the data frame such as finding the mean. For example, to compute the average city miles per gallon and average highway miles per gallon, we apply the mean()
function to these columns.
mpg %>%
summarise(cty_avg = mean(cty), hwy_avg = mean(hwy))
## # A tibble: 1 x 2
## cty_avg hwy_avg
## <dbl> <dbl>
## 1 16.85897 23.44017
Some other statistics you may want to apply are:
: standard deviation of columnmin()
: min value in columnmax()
: max value in columnmedian()
: median value in columnsum()
: sum of all values in columnn()
: number of entries in columnn_distinct()
: number of unique entries in columnfirst()
: returns first value in columnlast()
: returns last value in column
The group_by() verb is an important function in dplyr. As we mentioned before it’s related to concept of “split-apply-combine”. We literally want to split the data frame by some variable (e.g. manufacturer), apply a function to the individual data frames and then combine the output.
Let’s do that: split the mpg data frame by the manufactuer, then ask for the same summary statistics as above. We expect a set of summary statistics for each manufacturer.
mpg %>%
group_by(manufacturer) %>%
summarise(avg_year = mean(year), avg_cyl = mean(cyl),
avg_cty = mean(cty), avg_hwy = mean(hwy))
## # A tibble: 15 x 5
## manufacturer avg_year avg_cyl avg_cty avg_hwy
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 audi 2003.500 5.222222 17.61111 26.44444
## 2 chevrolet 2004.684 7.263158 15.00000 21.89474
## 3 dodge 2004.108 7.081081 13.13514 17.94595
## 4 ford 2002.600 7.200000 14.00000 19.36000
## 5 honda 2003.000 4.000000 24.44444 32.55556
## 6 hyundai 2004.143 4.857143 18.64286 26.85714
## 7 jeep 2005.750 7.250000 13.50000 17.62500
## 8 land rover 2003.500 8.000000 11.50000 16.50000
## 9 lincoln 2002.000 8.000000 11.33333 17.00000
## 10 mercury 2003.500 7.000000 13.25000 18.00000
## 11 nissan 2003.846 5.538462 18.07692 24.61538
## 12 pontiac 2002.600 6.400000 17.00000 26.40000
## 13 subaru 2004.143 4.000000 19.28571 25.57143
## 14 toyota 2002.706 5.117647 18.52941 24.91176
## 15 volkswagen 2002.667 4.592593 20.92593 29.22222
R has several libraries for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.
If you’d like to learn more about the theoretical underpinnings of ggplot2 before you start, I’d recommend reading “The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf.
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
Among the variables in mpg are:
, a car’s engine size, in litres. -
, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Here we see a negative correlation between engine size to fuel efficiency.
By looking at our dataset
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p
## # ... with 1 more variables: class <chr>
We see many other variables we may want to encode into your visualization.
For example, the class of the car
ggplot(data = mpg,
mapping=aes(x = displ, y = hwy, color=class)) +
The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage.
ggplot(data = mpg,
mapping=aes(x = displ, y = hwy, shape=class)) +
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
Check out the docs and see what other visuals we can encode using the mapping=...
One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data. Note that this only works if you have Tidy data.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping=aes(x = displ, y = hwy)) +
geom_point() +
## `geom_smooth()` using method = 'loess'
We can do the same for having categorical plots.
mpg %>%
group_by(class) %>%
summarise(mean_hwy=mean(hwy)) %>%
ggplot(aes(x=class, y=mean_hwy)) +
geom_col() +
ggplot2 is a extremely expressive and powerful plotting library. If you want a deeper dive into this library, you can look into this R Tutorial from Harard. You can also look at Top 50 ggplot2 visualizations for some insperation for new plot ideas!