-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path11-reading-a-custom-delimited-format.Rmd
227 lines (181 loc) Β· 12.2 KB
/
11-reading-a-custom-delimited-format.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# Working With Custom Delimited Format Files
```{r message=FALSE, include=FALSE, warning=FALSE}
library(sergeant)
library(tidyverse)
```
## Problem
You have a custom delimited file format that you want to be able to work with Drill and R.
## Solution
Add a new entries to Drill storage `workspaces` and `formats` and use custom queries to have sane data types.
## Discussion
A helpful chap---[Ben Marwick](https://github.com/benmarwick)---provided an example of a truly evil set of delimited files from the [Wikimedia Commons](https://github.com/hrbrmstr/drill-sergeant-rstats/issues/1#issuecomment-376713894) that have pagecount data for all Wikimedia properties. The files are _huge_ so we'll be working with an extract. The link to the sample extract file will be posted soon, but it was created with:
$ bzcat pagecounts-2012-12-01.bz2 | head -1000024 > pagecounts-2012-02-01-excerpt.csvw
$ bzip2 --compress pagecounts-2012-02-01-excerpt.csvw
You'll find out the reason for the `csvw` in a bit.
The files all look like this:
# Wikimedia page request counts for 01/12/2012 (dd/mm/yyyy)
#
# Each line shows 'project page daily-total hourly-counts'
#
# Project is 'language-code project-code'
#
# Project-code is
#
# b:wikibooks,
# k:wiktionary,
# n:wikinews,
# q:wikiquote,
# s:wikisource,
# v:wikiversity,
# z:wikipedia (z added by merge script: wikipedia happens to be sorted last in dammit.lt files, but without suffix)
#
# Counts format: only hours with page view count > 0 (or data missing) are represented,
#
# Hour 0..23 shown as A..X (saves up to 22 bytes per line compared to comma separated values), followed by view count.
# If data are missing for some hour (file missing or corrupt) a question mark (?) is shown,
# and a adjusted daily total is extrapolated as follows: for each missing hour the total is incremented with hourly average
#
# Page titles are shown unmodified (preserves sort sequence)
#
Ar.mw Ar 5 M3R2
De.mw De 3 T3
En.mw En 3 E1F2
aa.b ?banner=B12_5C_113020_hover_nohover 11 A11
aa.b File:Broom_icon.svg 2 D1X1
aa.b File:Commons-logo.svg 1 X1
Just with another 65 _million_ lines (or so) per file.
They come [bzipped](http://bzip.org/) and Drill can definitely handle files with the `bz2` extension, but we are going to rename these files to help ensure we recognize the delimited encoding type when we browse the file system and when constructing queries.
The comments in the data files are helpful since they tell us the format:
- comment character is `#`
- space `' '` is the delimiter
- `project` is really `language-code` and `project-code` separated by a `.` (which we'll deal with here)
- there's a wonky bit of encoding in the `hourly-counts` (which we'll deal with in another recipe)
We're going to store all these pagecount files (we only have one due to size) in a `~/Data/wikimedia` directory so we can apply an input storage format directory-wide (this isn't 100% necessary but an organized filesystem is a happy filesystem) and create a special storage format for these files.
You'll need to go to <http://localhost:8047/storage> and "update" the `dfs` storage plugin so we can add these new entries. While the editor has syntax checking, it is recommended that you either use a browser that lets you change the size of the input text area box _or_ use an external text editor to do this work.
Find the line that has:
"workspaces": {
and add a new workspace, which is (for this filesystem example) just a `dfs` shortcut name to a location on the filesystem:
"wikimedia": {
"location": "/Users/bob/Data/wikimedia",
"writable": true,
"defaultInputFormat": "wikicsv",
"allowAccessOutsideWorkspace": false
},
This is JSON so the ending `,` is necessary unless it's the last entry in `workspaces`. You'll (again) need to change the specific path to the place where you've downloaded the file. We've taken a chance and made this directory writeable and told Drill that the default format will be "`wikicsv`" files, which we'll define next.
Now, find the line that starts with:
"formats": {
and add an entry for our new "`wikicsv`" format.
"wikicsv": {
"type": "text",
"extensions": [
"csvw"
],
"delimiter": " "
},
The default comment character for delimited files is "`#`" so there's no need to save that.
Save the configuration update and head over to <http://localhost:8047/query> (or use `drill-conf` on the command-line) to test out the new setup.
The example pagecount file is named `pagecounts-2012-02-01-excerpt.csvw.bz2` which lets Drill know it's a `wikimedia` delimited file and also that it's bzip compressed.
Let's try a query:
0: jdbc:drill:> SELECT * FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10;
+------------------------------------------------------------+
| columns |
+------------------------------------------------------------+
| ["Ar.mw","Ar","5","M3R2"] |
| ["De.mw","De","3","T3"] |
| ["En.mw","En","3","E1F2"] |
| ["aa.b","?banner=B12_5C_113020_hover_nohover","11","A11"] |
| ["aa.b","File:Broom_icon.svg","2","D1X1"] |
| ["aa.b","File:Commons-logo.svg","1","X1"] |
| ["aa.b","File:Incubator-notext.svg","5","C1G1V1X2"] |
| ["aa.b","File:Wikibooks-logo.svg","1","X1"] |
| ["aa.b","File:Wikimania.svg","1","X1"] |
| ["aa.b","File:Wikimedia-logo.svg","1","X1"] |
+------------------------------------------------------------+
10 rows selected (0.268 seconds)
While it's not _exactly_ what we want, the good news is that the headers were skipped and the file decompressed perfectly. However, we'll need to use `columns[#]` to help get to the data in each column. Let's try that again:
0: jdbc:drill:> SELECT columns[0], columns[1], columns[2], columns[3] FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10;
+---------+--------------------------------------+---------+-----------+
| EXPR$0 | EXPR$1 | EXPR$2 | EXPR$3 |
+---------+--------------------------------------+---------+-----------+
| Ar.mw | Ar | 5 | M3R2 |
| De.mw | De | 3 | T3 |
| En.mw | En | 3 | E1F2 |
| aa.b | ?banner=B12_5C_113020_hover_nohover | 11 | A11 |
| aa.b | File:Broom_icon.svg | 2 | D1X1 |
| aa.b | File:Commons-logo.svg | 1 | X1 |
| aa.b | File:Incubator-notext.svg | 5 | C1G1V1X2 |
| aa.b | File:Wikibooks-logo.svg | 1 | X1 |
| aa.b | File:Wikimania.svg | 1 | X1 |
| aa.b | File:Wikimedia-logo.svg | 1 | X1 |
+---------+--------------------------------------+---------+-----------+
10 rows selected (0.18 seconds)
Better! However, those computed column names will have to go and we also need to do something about the first column since it's really two columns compressed together. Renaming is straightforward (the `sqlline` prefixes will be dropped from now on to make it easier to copy/past the SQL queries; just assume it's being used in a Drill web query box or `sqlline` prompt):
SELECT
columns[0] AS project,
columns[1] AS page,
CAST(columns[2] AS DOUBLE) AS daily_total,
columns[3] AS hr_ct
FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10
+----------+--------------------------------------+--------------+-----------+
| project | page | daily_total | hr_ct |
+----------+--------------------------------------+--------------+-----------+
| Ar.mw | Ar | 5.0 | M3R2 |
| De.mw | De | 3.0 | T3 |
| En.mw | En | 3.0 | E1F2 |
| aa.b | ?banner=B12_5C_113020_hover_nohover | 11.0 | A11 |
| aa.b | File:Broom_icon.svg | 2.0 | D1X1 |
| aa.b | File:Commons-logo.svg | 1.0 | X1 |
| aa.b | File:Incubator-notext.svg | 5.0 | C1G1V1X2 |
| aa.b | File:Wikibooks-logo.svg | 1.0 | X1 |
| aa.b | File:Wikimania.svg | 1.0 | X1 |
| aa.b | File:Wikimedia-logo.svg | 1.0 | X1 |
+----------+--------------------------------------+--------------+-----------+
10 rows selected (1.039 seconds)
Along with having better column names we also took the opportunity to ensure `daily_total` is a numeric type vs a character. The `hr_ct` column conversion will be covered in another recipe (we'll need to make a Drill User Defined Function [UDF] for it) but we can handle separating `project` out here:
SELECT
SUBSTR(columns[0], 1, STRPOS(columns[0], '.')-1) AS language_code,
SUBSTR(columns[0], STRPOS(columns[0], '.')+1) AS project_code,
columns[1] AS page,
CAST(columns[2] AS DOUBLE) AS daily_total,
columns[3] AS hr_ct
FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10
+----------------+---------------+--------------------------------------+--------------+-----------+
| language_code | project_code | page | daily_total | hr_ct |
+----------------+---------------+--------------------------------------+--------------+-----------+
| Ar | mw | Ar | 5.0 | M3R2 |
| De | mw | De | 3.0 | T3 |
| En | mw | En | 3.0 | E1F2 |
| aa | b | ?banner=B12_5C_113020_hover_nohover | 11.0 | A11 |
| aa | b | File:Broom_icon.svg | 2.0 | D1X1 |
| aa | b | File:Commons-logo.svg | 1.0 | X1 |
| aa | b | File:Incubator-notext.svg | 5.0 | C1G1V1X2 |
| aa | b | File:Wikibooks-logo.svg | 1.0 | X1 |
| aa | b | File:Wikimania.svg | 1.0 | X1 |
| aa | b | File:Wikimedia-logo.svg | 1.0 | X1 |
+----------------+---------------+--------------------------------------+--------------+-----------+
10 rows selected (0.188 seconds)
Now, we can try that from R with `sergeant`:
```{r 08-sql-01, message=FALSE, warning=FALSE, cache=TRUE}
library(sergeant)
library(tidyverse)
db <- src_drill("localhost")
# remember we need to wrap this custom query in `()`
tbl(db, "(
SELECT
SUBSTR(columns[0], 1, STRPOS(columns[0], '.')-1) AS language_code,
SUBSTR(columns[0], STRPOS(columns[0], '.')+1) AS project_code,
columns[1] AS page,
CAST(columns[2] AS DOUBLE) AS daily_total,
columns[3] AS hr_ct
FROM dfs.wikimedia.`/*.csvw.bz2`
)") -> pagecount
pagecount
count(pagecount, language_code, sort=TRUE) %>%
collect() %>%
print(n=23)
```
As noted, converting the `hr_ct` column is complex enough that it needs its own recipe. And, take note of Recipe `TBD` that shows how to convert these delimited files into parquet files which can really speed up processing and reduce memory requirements.
NOTE: These Wikimedia Commons pagecount files would work best converted to parquet files and stored in in a small Drill cluster but they can be handled on a single-node system with the right configuration.
## See Also
- [Configure Drill Introduction](https://drill.apache.org/docs/configure-drill-introduction/)
- [Text Files: CSV, TSV, PSV](https://drill.apache.org/docs/text-files-csv-tsv-psv/)