Skip to content

Commit

Permalink
vignette render with markdown rather than rmarkdown (#5773)
Browse files Browse the repository at this point in the history
* vignette render with markdown rather than rmarkdown

* tune TOC
  • Loading branch information
jangorecki authored Dec 2, 2023
1 parent 8d7e40d commit 860b22e
Show file tree
Hide file tree
Showing 12 changed files with 61 additions and 40 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Version: 1.14.9
Title: Extension of `data.frame`
Depends: R (>= 3.1.0)
Imports: methods
Suggests: bit64 (>= 4.0.0), bit (>= 4.0.4), curl, R.utils, xts, zoo (>= 1.8-1), yaml, knitr, rmarkdown, markdown
Suggests: bit64 (>= 4.0.0), bit (>= 4.0.4), curl, R.utils, xts, zoo (>= 1.8-1), yaml, knitr, markdown
Description: Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.
License: MPL-2.0 | file LICENSE
URL: https://r-datatable.com, https://Rdatatable.gitlab.io/data.table, https://github.com/Rdatatable/data.table
Expand Down
6 changes: 6 additions & 0 deletions vignettes/css/toc.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#TOC {
border: 1px solid #ccc;
border-radius: 5px;
padding-left: 1em;
background: #f6f6f6;
}
17 changes: 13 additions & 4 deletions vignettes/datatable-benchmarking.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,24 @@
title: "Benchmarking data.table"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
number_sections: true
markdown::html_format:
options:
toc: true
number_sections: true
meta:
css: [default, css/toc.css]
vignette: >
%\VignetteIndexEntry{Benchmarking data.table}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

<style>
h2 {
font-size: 20px;
}
</style>

This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid.

# fread: clear caches
Expand Down
37 changes: 20 additions & 17 deletions vignettes/datatable-faq.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@
title: "Frequently Asked Questions about data.table"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
number_sections: true
markdown::html_format:
options:
toc: true
number_sections: true
meta:
css: [default, css/toc.css]
vignette: >
%\VignetteIndexEntry{Frequently Asked Questions about data.table}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down Expand Up @@ -94,13 +97,13 @@ As [highlighted above](#j-num), `j` in `[.data.table` is fundamentally different

Furthermore, data.table _inherits_ from `data.frame`. It _is_ a `data.frame`, too. A data.table can be passed to any package that only accepts `data.frame` and that package can use `[.data.frame` syntax on the data.table. See [this answer](https://stackoverflow.com/a/10529888/403310) for how that is achieved.

We _have_ proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :
We _have_ proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0:

> `unique()` and `match()` are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matt Dowle for suggesting improvements to the way the hash code is generated in unique.c.
A second proposal was to use `memcpy` in duplicate.c, which is much faster than a for loop in C. This would improve the _way_ that R copies data internally (on some measures by 13 times). The thread on r-devel is [here](https://stat.ethz.ch/pipermail/r-devel/2010-April/057249.html).

A third more significant proposal that was accepted is that R now uses data.table's radix sort code as from R 3.3.0 :
A third more significant proposal that was accepted is that R now uses data.table's radix sort code as from R 3.3.0:

> The radix sort algorithm and implementation from data.table (forder) replaces the previous radix (counting) sort and adds a new method for order(). Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ?sort).
Expand Down Expand Up @@ -236,7 +239,7 @@ Then you are using a version prior to 1.5.3. Prior to 1.5.3 `[.data.table` detec

## What are the scoping rules for `j` expressions?

Think of the subset as an environment where all the column names are variables. When a variable `foo` is used in the `j` of a query such as `X[Y, sum(foo)]`, `foo` is looked for in the following order :
Think of the subset as an environment where all the column names are variables. When a variable `foo` is used in the `j` of a query such as `X[Y, sum(foo)]`, `foo` is looked for in the following order:

1. The scope of `X`'s subset; _i.e._, `X`'s column names.
2. The scope of each row of `Y`; _i.e._, `Y`'s column names (_join inherited scope_)
Expand Down Expand Up @@ -295,18 +298,18 @@ The `Z[Y]` part is not a single name so that is evaluated within the frame of `X

## Can you explain further why data.table is inspired by `A[B]` syntax in `base`?

Consider `A[B]` syntax using an example matrix `A` :
Consider `A[B]` syntax using an example matrix `A`:
```{r}
A = matrix(1:12, nrow = 4)
A
```

To obtain cells `(1, 2) = 5` and `(3, 3) = 11` many users (we believe) may try this first :
To obtain cells `(1, 2) = 5` and `(3, 3) = 11` many users (we believe) may try this first:
```{r}
A[c(1, 3), c(2, 3)]
```

However, this returns the union of those rows and columns. To reference the cells, a 2-column matrix is required. `?Extract` says :
However, this returns the union of those rows and columns. To reference the cells, a 2-column matrix is required. `?Extract` says:

> When indexing arrays by `[` a single argument `i` can be a matrix with as many columns as there are dimensions of `x`; the result is then a vector with elements corresponding to the sets of indices in each row of `i`.
Expand Down Expand Up @@ -354,7 +357,7 @@ Furthermore, matrices, especially sparse matrices, are often stored in a 3-colum
data.table _inherits_ from `data.frame`. It _is_ a `data.frame`, too. A data.table _can_ be passed to any package that _only_ accepts `data.frame`. When that package uses `[.data.frame` syntax on the data.table, it works. It works because `[.data.table` looks to see where it was called from. If it was called from such a package, `[.data.table` diverts to `[.data.frame`.

## I've heard that data.table syntax is analogous to SQL.
Yes :
Yes:

- `i` $\Leftrightarrow$ where
- `j` $\Leftrightarrow$ select
Expand All @@ -367,7 +370,7 @@ Yes :
- `mult = "first"|"last"` $\Leftrightarrow$ N/A because SQL is inherently unordered
- `roll = TRUE` $\Leftrightarrow$ N/A because SQL is inherently unordered

The general form is :
The general form is:

```{r, eval = FALSE}
DT[where, select|update, group by][order by][...] ... [...]
Expand Down Expand Up @@ -447,7 +450,7 @@ Many thanks to the R core team for fixing the issue in Sep 2019. data.table v1.1

This comes up quite a lot but it's really earth-shatteringly simple. A function such as `merge` is _generic_ if it consists of a call to `UseMethod`. When you see people talking about whether or not functions are _generic_ functions they are merely typing the function without `()` afterwards, looking at the program code inside it and if they see a call to `UseMethod` then it is _generic_. What does `UseMethod` do? It literally slaps the function name together with the class of the first argument, separated by period (`.`) and then calls that function, passing along the same arguments. It's that simple. For example, `merge(X, Y)` contains a `UseMethod` call which means it then _dispatches_ (i.e. calls) `paste("merge", class(X), sep = ".")`. Functions with dots in their name may or may not be methods. The dot is irrelevant really, other than dot being the separator that `UseMethod` uses. Knowing this background should now highlight why, for example, it is obvious to R folk that `as.data.table.data.frame` is the `data.frame` method for the `as.data.table` generic function. Further, it may help to elucidate that, yes, you are correct, it is not obvious from its name alone that `ls.fit` is not the fit method of the `ls` generic function. You only know that by typing `ls` (not `ls()`) and observing it isn't a single call to `UseMethod`.

You might now ask: where is this documented in R? Answer: it's quite clear, but, you need to first know to look in `?UseMethod` and _that_ help file contains :
You might now ask: where is this documented in R? Answer: it's quite clear, but, you need to first know to look in `?UseMethod` and _that_ help file contains:

> When a function calling `UseMethod('fun')` is applied to an object with class attribute `c('first', 'second')`, the system searches for a function called `fun.first` and, if it finds it, applies it to the object. If no such function is found a function called `fun.second` is tried. If no class name produces a suitable function, the function `fun.default` is used, if it exists, or an error results.
Expand Down Expand Up @@ -481,7 +484,7 @@ copied in bulk (`memcpy` in C) rather than looping in C.
## What are primary and secondary indexes in data.table?

Manual: [`?setkey`](https://www.rdocumentation.org/packages/data.table/functions/setkey)
S.O. : [What is the purpose of setting a key in data.table?](https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table/20057411#20057411)
S.O.: [What is the purpose of setting a key in data.table?](https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table/20057411#20057411)

`setkey(DT, col1, col2)` orders the rows by column `col1` then within each group of `col1` it orders by `col2`. This is a _primary index_. The row order is changed _by reference_ in RAM. Subsequent joins and groups on those key columns then take advantage of the sort order for efficiency. (Imagine how difficult looking for a phone number in a printed telephone directory would be if it wasn't sorted by surname then forename. That's literally all `setkey` does. It sorts the rows by the columns you specify.) The index doesn't use any RAM. It simply changes the row order in RAM and marks the key columns. Analogous to a _clustered index_ in SQL.

Expand Down Expand Up @@ -521,7 +524,7 @@ DT[ , { mySD = copy(.SD)

Please upgrade to v1.8.1 or later. From this version, if `.N` is returned by `j` it is renamed to `N` to avoid any ambiguity in any subsequent grouping between the `.N` special variable and a column called `".N"`.

The old behaviour can be reproduced by forcing `.N` to be called `.N`, like this :
The old behaviour can be reproduced by forcing `.N` to be called `.N`, like this:
```{r}
DT = data.table(a = c(1,1,2,2,2), b = c(1,2,2,2,1))
DT
Expand All @@ -533,7 +536,7 @@ cat(try(

If you are already running v1.8.1 or later then the error message is now more helpful than the "cannot change value of locked binding" error, as you can see above, since this vignette was produced using v1.8.1 or later.

The more natural syntax now works :
The more natural syntax now works:
```{r}
if (packageVersion("data.table") >= "1.8.1") {
DT[ , .N, by = list(a, b)][ , unique(N), by = a]
Expand All @@ -555,7 +558,7 @@ Hopefully, this is self explanatory. The full message is:
Coerced numeric RHS to integer to match the column's type; may have truncated precision. Either change the column to numeric first by creating a new numeric vector length 5 (nrows of entire table) yourself and assigning that (i.e. 'replace' column), or coerce RHS to integer yourself (e.g. 1L or as.integer) to make your intent clear (and for speed). Or, set the column type correctly up front when you create the table and stick to it, please.


To generate it, try :
To generate it, try:

```{r}
DT = data.table(a = 1:5, b = 1:5)
Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-importing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
title: "Importing data.table"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
markdown::html_format
vignette: >
%\VignetteIndexEntry{Importing data.table}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
title: "Introduction to data.table"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
markdown::html_format
vignette: >
%\VignetteIndexEntry{Introduction to data.table}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-keys-fast-subset.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
title: "Keys and fast binary search based subset"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
markdown::html_format
vignette: >
%\VignetteIndexEntry{Keys and fast binary search based subset}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-programming.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
title: "Programming on data.table"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
markdown::html_format
vignette: >
%\VignetteIndexEntry{Programming on data.table}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-reference-semantics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
title: "Reference semantics"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
markdown::html_format
vignette: >
%\VignetteIndexEntry{Reference semantics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-reshape.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
title: "Efficient reshaping using data.tables"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
markdown::html_format
vignette: >
%\VignetteIndexEntry{Efficient reshaping using data.tables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down
11 changes: 7 additions & 4 deletions vignettes/datatable-sd-usage.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@
title: "Using .SD for Data Analysis"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
number_sections: true
markdown::html_format:
options:
toc: true
number_sections: true
meta:
css: [default, css/toc.css]
vignette: >
%\VignetteIndexEntry{Using .SD for Data Analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down
4 changes: 2 additions & 2 deletions vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
title: "Secondary indices and auto indexing"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
markdown::html_format
vignette: >
%\VignetteIndexEntry{Secondary indices and auto indexing}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEngine{knitr::knitr}
\usepackage[utf8]{inputenc}
---

Expand Down

0 comments on commit 860b22e

Please sign in to comment.