Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged
merged 62 commits into from
Jul 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
b0efcf5
lazy forder
jangorecki Apr 18, 2020
aaead65
fix tests
jangorecki Apr 18, 2020
b0858df
fix tests
jangorecki Apr 18, 2020
9df2668
respect use.index option
jangorecki Apr 18, 2020
62982ed
bmerge timings
jangorecki Apr 18, 2020
eec4971
codecov
jangorecki Apr 18, 2020
9affbfa
helper function
jangorecki Apr 19, 2020
b36136a
reduce diff to master
jangorecki Apr 20, 2020
3442b78
rename fix
jangorecki Apr 20, 2020
dfe883e
setindex writes groups (retGrp=TRUE)
jangorecki Apr 20, 2020
efb3319
calc order, not groups
jangorecki Apr 20, 2020
68b378d
expect to reach optimization
jangorecki Apr 20, 2020
0bd0e1e
skip opt for list
jangorecki Apr 20, 2020
71f5aeb
override retGrp=F by retGrp=T is legit use
jangorecki Apr 20, 2020
de916aa
more backward compatiblility, no retGrp from getindex
jangorecki Apr 20, 2020
0a3da1d
more verbose messages during opts in forderLazy
jangorecki Apr 20, 2020
068c1a9
recycle order 1/-1 argument in one place
jangorecki Apr 20, 2020
326695d
precise verbose messages
jangorecki Apr 20, 2020
8be2d53
recycle arg once _by_ arg is known
jangorecki Apr 20, 2020
3383bbf
copy paste typo fix
jangorecki Apr 21, 2020
cc7a2b6
forder writing index disabled
jangorecki Apr 21, 2020
514913a
fix tests
jangorecki Apr 21, 2020
ca2823d
code coverage
jangorecki Apr 21, 2020
4e9bbb3
minor update for safer use of internal option
jangorecki Apr 21, 2020
4d0dec3
fix bad name in unit test
jangorecki Apr 21, 2020
1606046
retGrp=F requires downgrade idx and it seems to be costly
jangorecki May 8, 2020
a9b01ff
NA stats from forder
jangorecki May 20, 2020
ded1e1a
keyOpt fix, and existing tests
jangorecki May 20, 2020
53cce98
fixes for na.last in key and setting idx
jangorecki May 20, 2020
1e92366
filling tests for na.last=T and possible fixes
jangorecki May 20, 2020
0f95e6f
more stats, any non ascii utf8
jangorecki May 21, 2020
bbe1d03
better naming of new stats attributes
jangorecki May 21, 2020
91437de
add extra escape to escape IS_ASCII checks
jangorecki Jun 29, 2020
5239c3f
Merge branch 'master' into lazy-forder
jangorecki Jun 29, 2020
0f6b15a
Merge branch 'master' into lazy-forder
mattdowle Jan 5, 2021
14e9168
Merge branch 'master' into lazy-forder
jangorecki Dec 9, 2023
8bf5b5c
Merge branch 'master' into lazy-forder
MichaelChirico Mar 15, 2024
0f15775
update test number after merge
MichaelChirico Mar 15, 2024
e2710b8
Merge branch 'master' into lazy-forder
MichaelChirico Jul 11, 2024
c8f5b7e
apply minor review feedback
MichaelChirico Jul 11, 2024
92a97d1
More minor review feedback
MichaelChirico Jul 11, 2024
c2ae34a
use options= to set options
MichaelChirico Jul 11, 2024
4184fe9
more feedback
MichaelChirico Jul 11, 2024
0588d4d
Rename forderLazy->forderMaybePresorted
MichaelChirico Jul 12, 2024
8b3a80a
UNPROTECT() more aggressively
MichaelChirico Jul 12, 2024
497a08f
maybe_reset_index() helper
MichaelChirico Jul 12, 2024
b948345
Merge remote-tracking branch 'origin/master' into lazy-forder
MichaelChirico Jul 15, 2024
3e898e9
Strict prototyping (-Wstrict-prototypes)
MichaelChirico Jul 15, 2024
d8e0ea7
Merge remote-tracking branch 'origin/lazy-forder' into lazy-forder
MichaelChirico Jul 15, 2024
d8adf3c
fix sloppy refactor for maybe_reset_index()
MichaelChirico Jul 15, 2024
1d398aa
Fix implicit reliance on datatable.optimize
MichaelChirico Jul 15, 2024
ac19b83
Fix elsewhere, and encapsulate the logic inside a test()
MichaelChirico Jul 15, 2024
9675f06
style
MichaelChirico Jul 15, 2024
e8b9dcd
spurious whitespace change
MichaelChirico Jul 15, 2024
a48ff5e
NEWS entry for lazy forder
jangorecki Jul 21, 2024
59f5f21
tidy up NEWS
MichaelChirico Jul 21, 2024
32c6305
PROTECT() on key attribute
MichaelChirico Jul 21, 2024
050e048
Merge branch 'master' into lazy-forder
MichaelChirico Jul 27, 2024
02ff718
rename arg/option 'lazy' -> 'reuseSorting'
MichaelChirico Jul 27, 2024
320f496
MaybeSorted->ReuseSorting
MichaelChirico Jul 27, 2024
799b4a9
other lazy= usage
MichaelChirico Jul 27, 2024
ffe431f
Merge branch 'master' into lazy-forder
MichaelChirico Jul 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,46 @@

22. C code was unified more in how failures to allocate memory (`malloc()`/`calloc()`) are handled, (#1115)[https://github.com/Rdatatable/data.table/issues/1115]. No OOM issues were reported, as these regions of code typically request relatively small blocks of memory, but it is good to handle memory pressure consistently. Thanks @elfring for the report and @MichaelChirico for the clean-up effort and future-proofing linter.

22. Internal routine for finding sort order will now re-use any existing index. A similar optimization was already present in R code, but this has now been pushed to C and covers a wider range of use cases and collects more statistics about its input (e.g. whether any infinite entries were found), opening the possibility for more optimizations in other functions.

Functions `setindex` (and `setindexv`) will now compute groups' positions as well. `setindex()` also collects the extra statistics alluded to above.

Finding sort order in other routines (for example subset `d2[id==1L]`) does not include those extra statistics so as not to impose a slowdown.

```r
d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
str(attr(attr(d2, "index"), "__id"))
# int [1:2] 2 1
# - attr(*, "starts")= int [1:2] 1 2
# - attr(*, "maxgrpn")= int 1
# - attr(*, "anyna")= int 0
# - attr(*, "anyinfnan")= int 0
# - attr(*, "anynotascii")= int 0
# - attr(*, "anynotutf8")= int 0

d2 = data.table(id=2:1, v2=1:2)
invisible(d2[id==1L])
str(attr(attr(d2, "index"), "__id"))
# int [1:2] 2 1
```

This feature also enables re-use of sort index during joins, in cases where one of the calls to find sort order is made from C code.

```r
d1 = data.table(id=1:2, v1=1:2)
d2 = data.table(id=2:1, v2=1:2)
setindexv(d2, "id")
d1[d2, on="id", verbose=TRUE]
#...
#Starting bmerge ...
#forderReuseSorting: using existing index: __id
#forderReuseSorting: opt=2, took 0.000s
#...
```

This feature resolves [#4387](https://github.com/Rdatatable/data.table/issues/4387), [#2947](https://github.com/Rdatatable/data.table/issues/2947), [#4380](https://github.com/Rdatatable/data.table/issues/4380), and [#1321](https://github.com/Rdatatable/data.table/issues/1321). Thanks to @jangorecki, @jan-glx, and @MichaelChirico for the reports and @jangorecki for implementing.

## TRANSLATIONS

1. Fix a typo in a Mandarin translation of an error message that was hiding the actual error message, [#6172](https://github.com/Rdatatable/data.table/issues/6172). Thanks @trafficfan for the report and @MichaelChirico for the fix.
Expand Down
5 changes: 1 addition & 4 deletions R/bmerge.R
Original file line number Diff line number Diff line change
Expand Up @@ -118,9 +118,6 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
}
}

## after all modifications of i, check if i has a proper key on all icols
io = identical(icols, head(chmatch(key(i), names(i)), length(icols)))

## after all modifications of x, check if x has a proper key on all xcols.
## If not, calculate the order. Also for non-equi joins, the order must be calculated.
non_equi = which.first(ops != 1L) # 1 is "==" operator
Expand Down Expand Up @@ -180,7 +177,7 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
}

if (verbose) {last.started.at=proc.time();catf("Starting bmerge ...\n");flush.console()}
ans = .Call(Cbmerge, i, x, as.integer(icols), as.integer(xcols), io, xo, roll, rollends, nomatch, mult, ops, nqgrp, nqmaxgrp)
ans = .Call(Cbmerge, i, x, as.integer(icols), as.integer(xcols), xo, roll, rollends, nomatch, mult, ops, nqgrp, nqmaxgrp)
if (verbose) {catf("bmerge done in %s\n",timetaken(last.started.at)); flush.console()}
# TO DO: xo could be moved inside Cbmerge

Expand Down
12 changes: 6 additions & 6 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -3241,13 +3241,13 @@ is_constantish = function(q, check_singleton=FALSE) {
if (is.null(idx)){
## if nothing else helped, auto create a new index that can be used
if (!getOption("datatable.auto.index")) return(NULL)
if (verbose) {catf("Creating new index '%s'\n", paste(names(i), collapse = "__"));flush.console()}
if (verbose) {last.started.at=proc.time();catf("Creating index %s done in ...", paste(names(i), collapse = "__"));flush.console()}
setindexv(x, names(i))
if (verbose) {cat(timetaken(last.started.at),"\n");flush.console()}
if (verbose) {catf("Optimized subsetting with index '%s'\n", paste(names(i), collapse = "__"));flush.console()}
idx = attr(attr(x, "index", exact=TRUE), paste("__", names(i), collapse = ""), exact=TRUE)
idxCols = names(i)
if (verbose) {catf("Creating new index '%s'\n", paste(idxCols, collapse = "__"));flush.console()}
if (verbose) {last.started.at=proc.time();catf("Creating index %s done in ...", paste(idxCols, collapse = "__"));flush.console()}
idx = forderv(x, idxCols, sort=TRUE, retGrp=FALSE, reuseSorting=TRUE)
maybe_reset_index(x, idx, idxCols) ## forder can write index, but disabled for now, see #4386
if (verbose) {cat(timetaken(last.started.at),"\n");flush.console()}
if (verbose) {catf("Optimized subsetting with index '%s'\n", paste(idxCols, collapse = "__"));flush.console()}
}
if(!is.null(idxCols)){
setkeyv(i, idxCols)
Expand Down
61 changes: 23 additions & 38 deletions R/setkey.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,23 +50,9 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU
miss = !(cols %chin% colnames(x))
if (any(miss)) stopf("some columns are not in the data.table: %s", brackify(cols[miss]))

## determine, whether key is already present:
if (identical(key(x),cols)) {
if (!physical) {
## create index as integer() because already sorted by those columns
if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer())
setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), integer())
}
return(invisible(x))
} else if(identical(head(key(x), length(cols)), cols)){
if (!physical) {
## create index as integer() because already sorted by those columns
if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer())
setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), integer())
} else {
## key is present but x has a longer key. No sorting needed, only attribute is changed to shorter key.
setattr(x,"sorted",cols)
}
if (physical && identical(head(key(x), length(cols)), cols)){ ## for !physical we need to compute groups as well #4387
## key is present but x has a longer key. No sorting needed, only attribute is changed to shorter key.
setattr(x,"sorted",cols)
return(invisible(x))
}

Expand All @@ -77,26 +63,20 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU
}
if (!is.character(cols) || length(cols)<1L) stopf("Internal error. 'cols' should be character at this point in setkey; please report.") # nocov

newkey = paste(cols, collapse="__")
if (!any(indices(x) == newkey)) {
if (verbose) {
tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=FALSE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R
# suppress needed for tests 644 and 645 in verbose mode
catf("forder took %.03f sec\n", tt["user.self"]+tt["sys.self"])
} else {
o = forderv(x, cols, sort=TRUE, retGrp=FALSE)
}
if (verbose) {
# we now also retGrp=TRUE #4387 for !physical
tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=!physical, reuseSorting=TRUE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R
# suppress needed for tests 644 and 645 in verbose mode
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
catf("forder took %.03f sec\n", tt["user.self"]+tt["sys.self"])
} else {
if (verbose) catf("setkey on columns %s using existing index '%s'\n", brackify(cols), newkey)
o = getindex(x, newkey)
o = forderv(x, cols, sort=TRUE, retGrp=!physical, reuseSorting=TRUE)
}
if (!physical) {
if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer())
setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), o)
if (!physical) { # index COULD BE saved from C forderReuseSorting already, but disabled for now
maybe_reset_index(x, o, cols)
return(invisible(x))
}
setattr(x,"index",NULL) # TO DO: reorder existing indexes likely faster than rebuilding again. Allow optionally. Simpler for now to clear.
if (length(o)) {
setattr(x,"index",NULL) # TO DO: reorder existing indexes likely faster than rebuilding again. Allow optionally. Simpler for now to clear. Only when order changes.
Copy link
Member Author

@jangorecki jangorecki Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a related comment next to test 1419.62

# checks also that the prior index a is dropped (because y is keyed with no index)

but we should not drop that index, as long as order does not change then index is still valid, and it can even be useful now, because it can store groups info.

if (verbose) { last.started.at = proc.time() }
.Call(Creorder,x,o)
if (verbose) { catf("reorder took %s\n", timetaken(last.started.at)); flush.console() }
Expand Down Expand Up @@ -124,7 +104,7 @@ getindex = function(x, name) {
if (!is.null(ans) && (!is.integer(ans) || (length(ans)!=nrow(x) && length(ans)!=0L))) {
stopf("Internal error: index '%s' exists but is invalid", name) # nocov
}
ans
c(ans) ## drop starts and maxgrpn attributes
}

haskey = function(x) !is.null(key(x))
Expand Down Expand Up @@ -160,19 +140,24 @@ is.sorted = function(x, by=NULL) {
# Return value of TRUE/FALSE is relied on in [.data.table quite a bit on vectors. Simple. Stick with that (rather than -1/0/+1)
}

maybe_reset_index = function(x, idx, cols) {
if (isTRUE(getOption("datatable.forder.auto.index"))) return(invisible())
if (is.null(attr(x, "index", exact=TRUE))) setattr(x, "index", integer())
setattr(attr(x, "index", exact=TRUE), paste0("__", cols, collapse=""), idx)
invisible(x)
}

ORDERING_TYPES = c('logical', 'integer', 'double', 'complex', 'character')
forderv = function(x, by=seq_along(x), retGrp=FALSE, sort=TRUE, order=1L, na.last=FALSE)
{
forderv = function(x, by=seq_along(x), retGrp=FALSE, retStats=retGrp, sort=TRUE, order=1L, na.last=FALSE, reuseSorting=getOption("datatable.reuse.sorting", NA)) {
if (is.atomic(x) || is.null(x)) { # including forderv(NULL) which returns error consistent with base::order(NULL),
if (!missing(by) && !is.null(by)) stopf("x is a single vector, non-NULL 'by' doesn't make sense")
by = NULL
} else {
if (!length(x)) return(integer(0L)) # e.g. forderv(data.table(NULL)) and forderv(list()) return integer(0L))
by = colnamesInt(x, by, check_dups=FALSE)
if (length(order) == 1L) order = rep(order, length(by))
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
}
order = as.integer(order) # length and contents of order being +1/-1 is checked at C level
.Call(Cforder, x, by, retGrp, sort, order, na.last) # returns integer() if already sorted, regardless of sort=TRUE|FALSE
.Call(CforderReuseSorting, x, by, retGrp, retStats, sort, order, na.last, reuseSorting) # returns integer() if already sorted, regardless of sort=TRUE|FALSE
}

forder = function(..., na.last=TRUE, decreasing=FALSE)
Expand Down Expand Up @@ -209,7 +194,7 @@ forder = function(..., na.last=TRUE, decreasing=FALSE)
data = eval(sub, parent.frame(), parent.frame())
}
stopifnot(isTRUEorFALSE(decreasing))
o = forderv(data, seq_along(data), sort=TRUE, retGrp=FALSE, order= if (decreasing) -asc else asc, na.last)
o = forderv(data, seq_along(data), retGrp=FALSE, retStats=FALSE, sort=TRUE, order=if (decreasing) -asc else asc, na.last=na.last)
if (!length(o) && length(data)>=1L) o = seq_along(data[[1L]]) else o
o
}
Expand Down
Loading
Loading