Skip to content

Commit

Permalink
Add an illustrative example to ?GForce when sorting locale matters (#…
Browse files Browse the repository at this point in the history
…5342)

* improve documentation for GForce where sorting affects the result

* link issue

* tests

* typo

* mention Sys.setlocale

* obsolete comment

* 1.15.0 on CRAN. Bump to 1.15.99

* Fix transform slowness (#5493)

* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>

* Improvements to the introductory vignette (#5836)

* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes

* Vignette typo patch (#5402)

* fix typos and grammatical mistakes

* fix typos and punctuation

* remove double spaces where it wasn't necessary

* fix typos and adhere to British English spelling

* fix typos

* fix typos

* add missing closing bracket

* fix typos

* review fixes

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review benchmarking

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* remove unnecessary [ ] from datatable-keys-fast-subset.Rmd

* Update vignettes/datatable-programming.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-reshape.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* One last batch of fine-tuning

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <chiricom@google.com>

* fix bad merge

* Improved handling of list columns with NULL entries (#4250)

* Updated documentation for rbindlist(fill=TRUE)

* Print NULL entries of list as NULL

* Added news item

* edit NEWS, use '[NULL]' not 'NULL'

* fix test

* split NEWS item

* add example

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>

* clarify that list input->unnamed list output (#5383)

* clarify that list input->unnamed list output

* Add example where make.names is used

* mention role of make.names

* revert from next release branch

* manual merge NEWS

* manual rebase tests

* manual rebase data.table.R

* clarify 0 turns off everything

---------

Co-authored-by: Ofek <ofekshilon@gmail.com>
Co-authored-by: Ani <bloodraven166@gmail.com>
Co-authored-by: David Budzynski <56514985+davidbudzynski@users.noreply.github.com>
Co-authored-by: Scott Ritchie <sritchie73@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>
  • Loading branch information
6 people authored Jan 12, 2024
1 parent 64e2041 commit cde7333
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 4 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -617,4 +617,6 @@

19. In the NEWS for v1.11.0 (May 2018), section "NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES" item 2, we stated the intention to eventually change `logical01` to be `TRUE` by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e., `logical01` will remain `FALSE` by default in both `fread()` and `fwrite()`. See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading "sharded" CSVs where data with the same schema is split into many files, some of which could be converted to `logical` while others remain `integer`. We will retain the option `datatable.logical01` for users who wish to use a different default -- for example, if you are doing input/output on tables with a large number of logical columns, where writing '0'/'1' to the CSV many millions of times is preferable to writing 'TRUE'/'FALSE'.

20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.

# data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)
6 changes: 3 additions & 3 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1733,7 +1733,7 @@ replace_dot_alias = function(e) {
GForce = FALSE
if ( ((is.name(jsub) && jsub==".N") || (jsub %iscall% 'list' && length(jsub)==2L && jsub[[2L]]==".N")) && !length(lhs) ) {
GForce = TRUE
if (verbose) catf("GForce optimized j to '%s'\n",deparse(jsub, width.cutoff=200L, nlines=1L))
if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n",deparse(jsub, width.cutoff=200L, nlines=1L))
}
} else if (length(lhs) && is.symbol(jsub)) { # turn off GForce for the combination of := and .N
GForce = FALSE
Expand Down Expand Up @@ -1818,8 +1818,8 @@ replace_dot_alias = function(e) {
if (length(jsub) == 2L && jsub[[1L]] %chin% c("head", "tail")) jsub[["n"]] = 6L
jsub = gforce_jsub(jsub, names_x)
}
if (verbose) catf("GForce optimized j to '%s'\n", deparse(jsub, width.cutoff=200L, nlines=1L))
} else if (verbose) catf("GForce is on, left j unchanged\n");
if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n", deparse(jsub, width.cutoff=200L, nlines=1L))
} else if (verbose) catf("GForce is on, but not activated for this query; left j unchanged (see ?GForce)\n");
}
}
if (!GForce && !is.name(jsub)) {
Expand Down
2 changes: 1 addition & 1 deletion inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -14918,7 +14918,7 @@ test(2041.2, DT[, median(time), by=g], DT[c(2,5),.(g=g, V1=time)])
# They could run in level 1 with level 2 off, but output= would need to be changed and there's no need.
test(2042.1, DT[ , as.character(mean(date)), by=g, verbose=TRUE ],
data.table(g=c("a","b"), V1=c("2018-01-04","2018-01-21")),
output=msg<-"GForce is on, left j unchanged.*Old mean optimization is on, left j unchanged")
output=msg<-"GForce is on, but not activated.*Old mean optimization is on, left j unchanged")
# Since %b is e.g. "janv." in LC_TIME=fr_FR.UTF-8 locale, we need to
# have the target/y value in these tests depend on the locale as well, #3450.
Jan.2018 = format(strptime("2018-01-01", "%Y-%m-%d"), "%b-%Y")
Expand Down
24 changes: 24 additions & 0 deletions man/datatable-optimize.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ of these optimisations. They happen automatically.
Run the code under the \emph{example} section to get a feel for the performance
benefits from these optimisations.
Note that for all optimizations involving efficient sorts, the caveat mentioned
in \code{\link{setorder}} applies -- whenever data.table does the sorting,
it does so in "C-locale". This has some subtle implications; see Examples.
}
\details{
\code{data.table} reads the global option \code{datatable.optimize} to figure
Expand Down Expand Up @@ -101,6 +105,8 @@ indices set global option \code{options(datatable.use.index = FALSE)}.
\seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}} }
\examples{
\dontrun{
old = options(datatable.optimize = Inf)
# Generate a big data.table with a relatively many columns
set.seed(1L)
DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
Expand Down Expand Up @@ -151,6 +157,24 @@ system.time(ans1 <- DT[id == 100L]) # index + binary search subset
system.time(ans2 <- DT[id == 100L]) # only binary search subset
system.time(DT[id \%in\% 100:500]) # only binary search subset again
# sensitivity to collate order
old_lc_collate = Sys.getlocale("LC_COLLATE")
if (old_lc_collate == "C") {
Sys.setlocale("LC_COLLATE", "")
}
DT = data.table(
grp = rep(1:2, each = 4L),
var = c("A", "a", "0", "1", "B", "b", "0", "1")
)
options(datatable.optimize = Inf)
DT[, .(max(var), min(var)), by=grp]
# GForce is deactivated because of the ad-hoc column 'tolower(var)',
# through which the result for 'max(var)' may also change
DT[, .(max(var), min(tolower(var))), by=grp]
Sys.setlocale("LC_COLLATE", old_lc_collate)
options(old)
}}
\keyword{ data }

0 comments on commit cde7333

Please sign in to comment.