RNotebooks.Rmd

---
title: "R Notebook"
author: "Derek Slone-Zhen"
date: "Wednesday, 12th April, 2017"
output:
  github_document: default
  html_document:
    df_print: kable
  html_notebook: default
  ioslides_presentation: default
  slidy_presentation: default
  word_document:
    reference_docx: Template.docx
always_allow_html: yes
---

# Setup

## Favourite Libraries

We'll load up some of my standard  R packages for later use.

```{r}
library (pacman)
p_load (magrittr)
p_load (ggplot2)
p_load (data.table)

```

## Language Engines for knitr

```{r}
is.windows <- .Platform$OS.type == "windows"
has.postgres <- !is.windows

if (is.windows) {
  knitr::opts_chunk$set(engine.path = list(
    bash = 'C:/Users/Derek Slone-Zhen/.babun/cygwin/bin/bash.exe',
    perl = "C:/Strawberry/perl/bin/perl.exe"
  ))
}
```

## And a windows cmd processor

```{r echo=FALSE}
source('win_cmd.R')
```


# Welcome to an RNotebooks

RNotebooks allow the use of multiple, interwoven languages.

We'll demonstrate the getting, ingestion, and analysis of a Fuel data set.

## Fetch 'n' Sniff

Fetch : I can do this in `R`, but the command prompt is my home.  Less friction for me here.

```{bash}
wget -c 'https://data.nsw.gov.au/data/dataset/a97a46fc-2bdd-4b90-ac7f-0cb1e8d7ac3b/resource/5ad2ad7d-ccb9-4bc3-819b-131852925ede/download/Service-Station-and-Price-History-March-2017.xlsx'

```

I'll take a quick look at the file, sometimes it's really a CSV file with an Excel extension.

```{bash}
hexdump -C Service-Station-and-Price-History-March-2017.xlsx | head -n20
```

OK, looks like a real Excel file.  The `PK` at the beginning is the give-away of a zipped file, which is what Excels
newer file formats are.  (Zipped XML files + some othe assets.)

## `readxl`

No external dependancies with this library, and installes with C / C++ native libraries for reading both
old and new Excel file formats.  Thanks [Hadley](http://hadley.nz/)!

```{r}
p_load(readxl)
DATA <- read_excel("Service-Station-and-Price-History-March-2017.xlsx")
p_load(data.table)
DATA <- data.table(DATA)

```

and take a peek:

```{r}
DATA[1:(if (interactive()) 1000 else 10),]

```

## Sniffing Deeply

Not the most friendly.  Lets try some extra packages:

```{r eval=interactive()}
# Only in the RNotebook
p_load(DT)
datatable(DATA[Suburb %in% c('Chatswood', 'Lane Cove', 'Artarmon', 'Lane Cove West')], filter="top")

```

## Summarising Data

```{r}
summary(DATA)

```

That's a lot of charaters that we're not getting summaries on.  Lets convert all characters to factors, and the postcodes too.

```{r}
for (j in which(sapply(DATA,is.character))) {
  set(DATA, j=j, value=factor(DATA[[j]], ordered = FALSE))
}

# Ask me why...
DATA <- DATA[,Postcode := factor(as.character(Postcode), ordered = FALSE)]
```

and try again:

```{r}
summary(DATA, maxsum = 8)

```

Lets focus in on our top four fuels.

```{r}
DATA[,.N,by=FuelCode][order(-N)] %>%
  head(n=4) ->
  top4

DATA4 <- DATA[FuelCode %in% top4$FuelCode]

```

## Visualising Data

```{r, fig.height=6, fig.width=10}
p_load(ggplot2)
ggplot(data=DATA4) +
  scale_y_continuous(limits=c(75,200)) +
  geom_violin(aes(y=Price, x=Brand)) +
  facet_grid(FuelCode ~ ., scales='free_y') +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

```

```{r, fig.height=6, fig.width=12}
g <- ggplot(data=DATA4[FuelCode == "U91"]) +
  geom_point(aes(y=Price, x=PriceUpdatedDate, colour=Brand), alpha=0.6, position='jitter') +
  scale_y_continuous(limits = c(75,175))
g

```

But what are those _really_ cheap petrol prices...

Let's get a more interactive visualisation.

```{r, fig.height=10, fig.width=18}
p_load(plotly)
g <- ggplot(data=DATA4[FuelCode == "U91"]) +
  geom_violin(aes(y=Price, x=Brand), colour="red", fill='red', alpha=0.25) +
  geom_boxplot(aes(y=Price, x=Brand), fill='transparent') +
  scale_y_continuous(limits = c(75,175)) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))
```


```{r, fig.height=10, fig.width=18}
print(g)
```

```{r, fig.height=10, fig.width=18, eval=interactive()}
# Only in the RNotebook
ggplotly(g)
```

# Copying Data To SQL Server

## Save as CSV (or better!)

```{r}
write.csv(DATA, 'Service-Station-and-Price-History-March-2017.csv', row.names = FALSE)

```

A couple of quick file tests - do I have a nice CSV I can upload?

Short of writing significant chunks of code, `BCP` is the only way to upload data quickly into
SQL Server, and it's _very_ picky over its file formats;
* doesn't tollerate quotes very well
* can tollearate 'embeded' field separators (i.e. the quotes don't help)
* can't tollerate embedded row separators (i.e. a new line within a quoted string)

```{bash}
< Service-Station-and-Price-History-March-2017.csv \
  tr -d -c ',\n' | 
  awk -e '1 {print length($0)}' | 
  sort | 
  uniq -c |
  sort -r -n
```

```{r}
ncol(DATA)
```

```{bash}
awk -F, -e 'NF != 9 {print}' Service-Station-and-Price-History-March-2017.csv | head

```

Blah!  Commas in the addresses (and quotes that BCP won't like either).

Re-export using [ASCII Delimiters](https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/)
0x1F (Unit Separator) and 0x1E (Record Separator), and supress the quotes.

```{r}
write.table(
  DATA, 
  'Service-Station-and-Price-History-March-2017.1F1E', 
  row.names = FALSE,
  quote = FALSE,
  sep = "\x1F",
  eol = "\x1E")

```

And re-test:


```{bash}
< Service-Station-and-Price-History-March-2017.1F1E \
  tr -d -c $'\x1E\x1F' | 
  tr $'\x1E' '\n' |
  awk -e '1 {print length($0)}' | 
  sort | 
  uniq -c

```
  
## Upload the 1E1F

```{r echo=FALSE}
# For non-windows, we won't to the sql parts
db <- NULL
```

And load up the odbc driver and connection to local Microsoft SQL Server Database.

```{r eval=is.windows}
p_load(DBI)
p_load(odbc)
# drv <- dbDriver("ODBC")
con_template <- 'driver={SQL Server Native Client 11.0};Server=%s;Database=%s;Trusted_Connection=yes'
# db <- dbConnect(drv, connection = sprintf(con_template, server=".", database= "test")) 
drv <- odbc::odbc()
db <- DBI::dbConnect(drv,.connection_string = sprintf(con_template, server=".", database= "test"))

```

```{r eval=has.postgres}
p_load(DBI)
p_load(RPostgreSQL) 
drv <- PostgreSQL()
db <- dbConnect(drv, dbname="test")
```

Check that the DB is good

```{r, eval=has.postgres, echo=has.postgres}
isPostgresqlIdCurrent(db)
```

Drop the table if it already exists

```{sql connection=db}
DROP TABLE IF EXISTS "Service-Station-and-Price-History-March-2017";
-- And return a result set to keep the RNotebook happy
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'Service-Station-and-Price-History-March-2017';

```

Use R to sketch out the body of an SQL `CREATE TABLE`.

```{r}
sprintf("%-20s %s not null,",
        colnames(DATA), 
        DATA %>%
          lapply(class) %>%
          sapply(head,1) %>%
          sapply(switch, 
               character = 'varchar(255)',
               POSIXct = 'datetime2(0)',
               numeric = 'numeric(4,1)')
  ) %>%
  paste0(collapse="\n") %>%
  cat

```


```{sql connection=db, eval=is.windows}
CREATE TABLE "Service-Station-and-Price-History-March-2017"
(
	ServiceStationName 		varchar(255) not null,
	Address 		          varchar(255) not null,
	Suburb 		            varchar(255) not null,
	Postcode 		          char(4) not null,
	Brand 		            varchar(255) not null,
	FuelCode 		          char(3) not null,
	PriceUpdatedDate 		  datetime2(0) not null,
	Price 		            numeric(4,1) not null
);
-- And return a result set to keep the RNotebook happy
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'Service-Station-and-Price-History-March-2017';

```

```{sql connection=db, eval=has.postgres}
CREATE TABLE "Service-Station-and-Price-History-March-2017"
(
	ServiceStationName 		varchar(255) not null,
	Address 		          varchar(255) not null,
	Suburb 		            varchar(255) not null,
	Postcode 		          char(4) not null,
	Brand 		            varchar(255) not null,
	FuelCode 		          char(3) not null,
	PriceUpdatedDate 		  timestamp not null,
	Price 		            numeric(4,1) not null
);
-- And return a result set to keep the RNotebook happy
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'Service-Station-and-Price-History-March-2017';

```

### Upload for SQL Server

I can never remember the syntax for `bcp` fully, so lets get a copy here for reference.

```{bash, error=TRUE, eval=is.windows}
bcp

```

Now I can craft the `bcp` for upload.


```{bash, eval=is.windows}
bcp \
  "dbo.[Service-Station-and-Price-History-March-2017]" \
  in \
  'Service-Station-and-Price-History-March-2017.1F1E' \
  -T -S . -d test \
  -c -t $'\x1F' -r $'\x1E' -C UTF-8 \
  -F 2 \
  -h TABLOCK -b 100000 \
  -e errors

```


Print out (the start of) any errors

```{bash, eval=is.windows}
head errors

```

### Upload for Postgres

```{bash eval=has.postgres}
< 'Service-Station-and-Price-History-March-2017.1F1E' \
  bbe -b ':/\x1E/' -e 'D 1;s/\\/\\\\/;s/\r/\\r/;s/\n/\\n/;s/\x1E/\n/' |
  psql test -c "COPY \"Service-Station-and-Price-History-March-2017\"
      FROM STDIN WITH DELIMITER AS E'\x1F'"
    
```

# Querying from Database

Now we can query from the database

```{r}
fuel <- 'U91'
```

```{sql connection=db, output.var='DBDATA'}
SELECT ServiceStationName, Suburb, Brand, PriceUpdatedDate, Price
FROM "Service-Station-and-Price-History-March-2017"
WHERE FuelCode = ?fuel
ORDER BY Price ASC
```

```{r}
DBDATA <- if (is.windows || has.postgres) data.table(DBDATA) else DATA
DBDATA[1:10]
```

# Save data and read it back in many Languages

```{r}
p_load(feather)
write_feather(DBDATA,"Service-Station-and-Price-History-March-2017.feather")
Sys.setenv(file_in="Service-Station-and-Price-History-March-2017")

```

```{python, error=is.windows}
import os
import pandas
import feather

file_in = os.environ["file_in"] + ".feather"
df = feather.read_dataframe(file_in)
print(df.head(10))

```


```{python}
import os
import pandas as pd

file_in = os.environ["file_in"] + ".csv"
df = pd.read_csv(file_in)
print(df.head(10))

```

```{perl}
use Parse::CSV;
use Data::Dumper;
 
my $objects = Parse::CSV->new(
    file => $ENV{file_in} . '.csv',
    names      => 1,
);

my $max_rows = 3;
while ( my $row = $objects->fetch ) {
  print Dumper($row);
  if (--$max_rows <= 0) { last; }
}
```

```{ruby}
require 'csv'
require 'pp'
file_in = ENV["file_in"] + ".csv"
customers = CSV.read(file_in)
pp(customers[1..3])
```


# Sillyness digression - what else can we do here?

## LaTeX fragments!

$$
x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
$$

Which, of course, also means we can use set algebra notation:

$$
Query = \{ \forall p \in [\text{Service-Station-and-Price-History-March-2017}] | p_{FuelCode} = \text{U91} \}

$$


# Tidy up after ourselves

```{r TakeDown}
if (!interactive()) {
  invisible({
    dbDisconnect(db)
    dbUnloadDriver(drv)
  })
}
```

# Sneaky Stuff

## I've a local bash script

The RNotebook mechanisms use a different strategy for executing code blocks (at lease bash one): 
namely that they write the text of the block to a temp file and then invoke the file along as:

`bash` _`file_name`_

Whereas the `knitr` engine invokes bash as `bash -c ` _`code_block`_.


```{r code=readLines('bash.bat'), eval=FALSE}
```

# Bulid Info & Version Control 

## sessionInfo

```{r}
sessionInfo()
```

## Version Control

This code ensure that when we `knit` the document, all changes get committed to
`git` and the SHA1 checksum of that commit is embedded in the document for 
reproducability.


```{bash}
git add -A .
git commit -m "Knitting..."
git rev-parse HEAD
```