Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged
merged 62 commits into from
Jul 28, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
b0efcf5
lazy forder
jangorecki Apr 18, 2020
aaead65
fix tests
jangorecki Apr 18, 2020
b0858df
fix tests
jangorecki Apr 18, 2020
9df2668
respect use.index option
jangorecki Apr 18, 2020
62982ed
bmerge timings
jangorecki Apr 18, 2020
eec4971
codecov
jangorecki Apr 18, 2020
9affbfa
helper function
jangorecki Apr 19, 2020
b36136a
reduce diff to master
jangorecki Apr 20, 2020
3442b78
rename fix
jangorecki Apr 20, 2020
dfe883e
setindex writes groups (retGrp=TRUE)
jangorecki Apr 20, 2020
efb3319
calc order, not groups
jangorecki Apr 20, 2020
68b378d
expect to reach optimization
jangorecki Apr 20, 2020
0bd0e1e
skip opt for list
jangorecki Apr 20, 2020
71f5aeb
override retGrp=F by retGrp=T is legit use
jangorecki Apr 20, 2020
de916aa
more backward compatiblility, no retGrp from getindex
jangorecki Apr 20, 2020
0a3da1d
more verbose messages during opts in forderLazy
jangorecki Apr 20, 2020
068c1a9
recycle order 1/-1 argument in one place
jangorecki Apr 20, 2020
326695d
precise verbose messages
jangorecki Apr 20, 2020
8be2d53
recycle arg once _by_ arg is known
jangorecki Apr 20, 2020
3383bbf
copy paste typo fix
jangorecki Apr 21, 2020
cc7a2b6
forder writing index disabled
jangorecki Apr 21, 2020
514913a
fix tests
jangorecki Apr 21, 2020
ca2823d
code coverage
jangorecki Apr 21, 2020
4e9bbb3
minor update for safer use of internal option
jangorecki Apr 21, 2020
4d0dec3
fix bad name in unit test
jangorecki Apr 21, 2020
1606046
retGrp=F requires downgrade idx and it seems to be costly
jangorecki May 8, 2020
a9b01ff
NA stats from forder
jangorecki May 20, 2020
ded1e1a
keyOpt fix, and existing tests
jangorecki May 20, 2020
53cce98
fixes for na.last in key and setting idx
jangorecki May 20, 2020
1e92366
filling tests for na.last=T and possible fixes
jangorecki May 20, 2020
0f95e6f
more stats, any non ascii utf8
jangorecki May 21, 2020
bbe1d03
better naming of new stats attributes
jangorecki May 21, 2020
91437de
add extra escape to escape IS_ASCII checks
jangorecki Jun 29, 2020
5239c3f
Merge branch 'master' into lazy-forder
jangorecki Jun 29, 2020
0f6b15a
Merge branch 'master' into lazy-forder
mattdowle Jan 5, 2021
14e9168
Merge branch 'master' into lazy-forder
jangorecki Dec 9, 2023
8bf5b5c
Merge branch 'master' into lazy-forder
MichaelChirico Mar 15, 2024
0f15775
update test number after merge
MichaelChirico Mar 15, 2024
e2710b8
Merge branch 'master' into lazy-forder
MichaelChirico Jul 11, 2024
c8f5b7e
apply minor review feedback
MichaelChirico Jul 11, 2024
92a97d1
More minor review feedback
MichaelChirico Jul 11, 2024
c2ae34a
use options= to set options
MichaelChirico Jul 11, 2024
4184fe9
more feedback
MichaelChirico Jul 11, 2024
0588d4d
Rename forderLazy->forderMaybePresorted
MichaelChirico Jul 12, 2024
8b3a80a
UNPROTECT() more aggressively
MichaelChirico Jul 12, 2024
497a08f
maybe_reset_index() helper
MichaelChirico Jul 12, 2024
b948345
Merge remote-tracking branch 'origin/master' into lazy-forder
MichaelChirico Jul 15, 2024
3e898e9
Strict prototyping (-Wstrict-prototypes)
MichaelChirico Jul 15, 2024
d8e0ea7
Merge remote-tracking branch 'origin/lazy-forder' into lazy-forder
MichaelChirico Jul 15, 2024
d8adf3c
fix sloppy refactor for maybe_reset_index()
MichaelChirico Jul 15, 2024
1d398aa
Fix implicit reliance on datatable.optimize
MichaelChirico Jul 15, 2024
ac19b83
Fix elsewhere, and encapsulate the logic inside a test()
MichaelChirico Jul 15, 2024
9675f06
style
MichaelChirico Jul 15, 2024
e8b9dcd
spurious whitespace change
MichaelChirico Jul 15, 2024
a48ff5e
NEWS entry for lazy forder
jangorecki Jul 21, 2024
59f5f21
tidy up NEWS
MichaelChirico Jul 21, 2024
32c6305
PROTECT() on key attribute
MichaelChirico Jul 21, 2024
050e048
Merge branch 'master' into lazy-forder
MichaelChirico Jul 27, 2024
02ff718
rename arg/option 'lazy' -> 'reuseSorting'
MichaelChirico Jul 27, 2024
320f496
MaybeSorted->ReuseSorting
MichaelChirico Jul 27, 2024
799b4a9
other lazy= usage
MichaelChirico Jul 27, 2024
ffe431f
Merge branch 'master' into lazy-forder
MichaelChirico Jul 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions R/bmerge.R
Original file line number Diff line number Diff line change
Expand Up @@ -116,9 +116,6 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
}
}

## after all modifications of i, check if i has a proper key on all icols
io = identical(icols, head(chmatch(key(i), names(i)), length(icols)))

## after all modifications of x, check if x has a proper key on all xcols.
## If not, calculate the order. Also for non-equi joins, the order must be calculated.
non_equi = which.first(ops != 1L) # 1 is "==" operator
Expand Down Expand Up @@ -178,7 +175,7 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
}

if (verbose) {last.started.at=proc.time();cat("Starting bmerge ...\n");flush.console()}
ans = .Call(Cbmerge, i, x, as.integer(icols), as.integer(xcols), io, xo, roll, rollends, nomatch, mult, ops, nqgrp, nqmaxgrp)
ans = .Call(Cbmerge, i, x, as.integer(icols), as.integer(xcols), xo, roll, rollends, nomatch, mult, ops, nqgrp, nqmaxgrp)
if (verbose) {cat("bmerge done in",timetaken(last.started.at),"\n"); flush.console()}
# TO DO: xo could be moved inside Cbmerge

Expand Down
5 changes: 2 additions & 3 deletions R/setkey.R
Original file line number Diff line number Diff line change
Expand Up @@ -172,8 +172,7 @@ is.sorted = function(x, by=seq_along(x)) {
}

ORDERING_TYPES = c('logical', 'integer', 'double', 'complex', 'character')
forderv = function(x, by=seq_along(x), retGrp=FALSE, sort=TRUE, order=1L, na.last=FALSE)
{
forderv = function(x, by=seq_along(x), retGrp=FALSE, sort=TRUE, order=1L, na.last=FALSE, lazy=NA) {
if (is.atomic(x)) { # including forderv(NULL) which returns error consistent with base::order(NULL),
if (!missing(by) && !is.null(by)) stop("x is a single vector, non-NULL 'by' doesn't make sense")
by = NULL
Expand All @@ -183,7 +182,7 @@ forderv = function(x, by=seq_along(x), retGrp=FALSE, sort=TRUE, order=1L, na.las
if (length(order) == 1L) order = rep(order, length(by))
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
}
order = as.integer(order) # length and contents of order being +1/-1 is checked at C level
.Call(Cforder, x, by, retGrp, sort, order, na.last) # returns integer() if already sorted, regardless of sort=TRUE|FALSE
.Call(Cforder, x, by, retGrp, sort, order, na.last, lazy) # returns integer() if already sorted, regardless of sort=TRUE|FALSE
}

forder = function(..., na.last=TRUE, decreasing=FALSE)
Expand Down
122 changes: 121 additions & 1 deletion inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -13364,7 +13364,7 @@ test(1962.0461, forderv(DT, order = c(1L, -1L)), error="Either order= is n
test(1962.0462, forderv(DT, order = 2L), error='Item 1 of order (ascending/descending) is 2. Must be +1 or -1')
test(1962.0471, forderv(mean), error="'x' argument must be data.table compatible")
test(1962.0472, forderv(DT, by=mean), error="argument specifying columns must be character or numeric")
test(1962.0473, forderv(NULL), error="DT is an empty list() of 0 columns")
test(1962.0473, forderv(NULL), error="DT is NULL")

setDF(DT)
test(1962.0481, forder(DT), 3:1)
Expand Down Expand Up @@ -16846,3 +16846,123 @@ A = data.table(A=c(complex(real = 1:3, imaginary=c(0, -1, 1)), NaN))
test(2138.3, rbind(A,B), data.table(A=c(as.character(A$A), B$A)))
A = data.table(A=as.complex(rep(NA, 5)))
test(2138.4, rbind(A,B), data.table(A=c(as.character(A$A), B$A)))

# lazy forder
dd = data.table(a=1:2, b=2:1)
d = copy(dd)
op = options(datatable.verbose=TRUE)
test(2139.01, forderv(d, "b"), 2:1, output="forder.*opt=-1.*took")
test(2139.02, forderv(d, "b", lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
setkeyv(d, "b")
options(datatable.verbose=TRUE)
test(2139.03, forderv(d, "b"), integer(), output="forder.*opt=1.*took")
test(2139.04, forderv(d, "b", lazy=FALSE), integer(), output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
d = copy(dd)
setindexv(d, "b")
options(datatable.verbose=TRUE)
test(2139.05, forderv(d, "b"), 2:1, output="forder.*opt=2.*took")
test(2139.06, forderv(d, "b", lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
d = copy(dd)
options(datatable.verbose=TRUE)
test(2139.11, forderv(d, c("a","b")), integer(), output="forder.*opt=-1.*took")
test(2139.12, forderv(d, c("a","b"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.13, forderv(d, c("b","a")), 2:1, output="forder.*opt=-1.*took")
test(2139.14, forderv(d, c("b","a"), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
setkeyv(d, c("a","b"))
options(datatable.verbose=TRUE)
test(2139.21, forderv(d, c("a","b")), integer(), output="forder.*opt=1.*took")
test(2139.22, forderv(d, c("a","b"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.23, forderv(d, c("b","a")), 2:1, output="forder.*opt=-1.*took")
test(2139.24, forderv(d, c("b","a"), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
setkeyv(d, c("b","a"))
options(datatable.verbose=TRUE)
test(2139.25, forderv(d, c("a","b")), 2:1, output="forder.*opt=-1.*took")
test(2139.26, forderv(d, c("a","b"), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
test(2139.27, forderv(d, c("b","a")), integer(), output="forder.*opt=1.*took")
test(2139.28, forderv(d, c("b","a"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
d = copy(dd)
setindexv(d, c("a","b"))
options(datatable.verbose=TRUE)
test(2139.31, forderv(d, c("a","b")), integer(), output="forder.*opt=2.*took")
test(2139.32, forderv(d, c("a","b"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.33, forderv(d, c("b","a")), 2:1, output="forder.*opt=-1.*took")
test(2139.34, forderv(d, c("b","a"), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
setindexv(d, NULL)
setindexv(d, c("b","a"))
options(datatable.verbose=TRUE)
test(2139.35, forderv(d, c("a","b")), integer(), output="forder.*opt=-1.*took")
test(2139.36, forderv(d, c("a","b"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.37, forderv(d, c("b","a")), 2:1, output="forder.*opt=2.*took")
test(2139.38, forderv(d, c("b","a"), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
setindexv(d, NULL)
setindexv(d, list(c("a","b"), c("b","a")))
options(datatable.verbose=TRUE)
test(2139.41, forderv(d, c("a","b")), integer(), output="forder.*opt=2.*took")
test(2139.42, forderv(d, c("a","b"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.43, forderv(d, c("b","a")), 2:1, output="forder.*opt=2.*took")
test(2139.44, forderv(d, c("b","a"), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
d = copy(dd)
setkeyv(d, c("a","b"))
setindexv(d, list(c("a","b"), c("b","a")))
options(datatable.verbose=TRUE)
test(2139.51, forderv(d, c("a","b")), integer(), output="forder.*opt=1.*took", notOutput="forder.*opt=2.*took") # idxOpt is not reached
test(2139.52, forderv(d, c("a","b"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.53, forderv(d, c("b","a")), 2:1, output="forder.*opt=2.*took")
test(2139.54, forderv(d, c("b","a"), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
setkeyv(d, NULL)
setindexv(d, list(c("a","b"), c("b","a")))
options(datatable.verbose=TRUE)
test(2139.55, forderv(d, c("a","b")), integer(), output="forder.*opt=2.*took", notOutput="forder.*opt=1.*took")
test(2139.56, forderv(d, c("a","b"), lazy=FALSE), integer(), output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
d = copy(dd)
setkeyv(d, c("a","b"))
setindexv(d, list(c("a","a"), c("b","a")))
options(datatable.verbose=TRUE)
ab = structure(integer(), starts=1:2, maxgrpn=1L)
ba = structure(2:1, starts=1:2, maxgrpn=1L)
test(2139.61, forderv(d, c("a","b"), retGrp=TRUE), ab, output="forder.*opt=0.*took")
test(2139.62, forderv(d, c("a","b"), retGrp=TRUE, lazy=FALSE), ab, output="forder.*opt=0.*took")
test(2139.63, forderv(d, c("b","a"), retGrp=TRUE), ba, output="forder.*opt=0.*took")
test(2139.64, forderv(d, c("b","a"), retGrp=TRUE, lazy=FALSE), ba, output="forder.*opt=0.*took")
test(2139.65, forderv(d, c("a","b"), na.last=TRUE), integer(), output="forder.*opt=0.*took")
test(2139.66, forderv(d, c("a","b"), na.last=TRUE, lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.67, forderv(d, c("b","a"), na.last=TRUE), 2:1, output="forder.*opt=0.*took")
test(2139.68, forderv(d, c("b","a"), na.last=TRUE, lazy=FALSE), 2:1, output="forder.*opt=0.*took")
test(2139.69, forderv(d, c("a","b"), sort=FALSE, retGrp=TRUE), ab, output="forder.*opt=0.*took")
test(2139.70, forderv(d, c("a","b"), sort=FALSE, retGrp=TRUE, lazy=FALSE), ab, output="forder.*opt=0.*took")
test(2139.71, forderv(d, c("b","a"), sort=FALSE, retGrp=TRUE), ab, output="forder.*opt=0.*took")
test(2139.72, forderv(d, c("b","a"), sort=FALSE, retGrp=TRUE, lazy=FALSE), ab, output="forder.*opt=0.*took")
test(2139.73, forderv(d, c("a","b"), order=-1L), 2:1, output="forder.*opt=0.*took")
test(2139.74, forderv(d, c("a","b"), order=-1L, lazy=FALSE), 2:1, output="forder.*opt=0.*took")
test(2139.75, forderv(d, c("b","a"), order=-1L), integer(), output="forder.*opt=0.*took")
test(2139.76, forderv(d, c("b","a"), order=-1L, lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.77, forderv(d, c("a","b"), order=c(1L,-1L)), integer(), output="forder.*opt=0.*took")
test(2139.78, forderv(d, c("a","b"), order=c(1L,-1L), lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.79, forderv(d, c("b","a"), order=c(1L,-1L)), 2:1, output="forder.*opt=0.*took")
test(2139.80, forderv(d, c("b","a"), order=c(1L,-1L), lazy=FALSE), 2:1, output="forder.*opt=0.*took")
test(2139.81, forderv(1:2), integer(), output="forder.*opt=0.*took")
test(2139.82, forderv(1:2, lazy=FALSE), integer(), output="forder.*opt=0.*took")
test(2139.83, forderv(2:1), 2:1, output="forder.*opt=0.*took")
test(2139.84, forderv(2:1, lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(datatable.verbose=FALSE)
d = copy(dd)
setindexv(d, "b")
options(datatable.verbose=TRUE)
op2 = options(datatable.use.index=FALSE)
test(2139.91, forderv(d, "b"), 2:1, output="forder.*opt=-1.*took")
test(2139.92, forderv(d, "b", lazy=FALSE), 2:1, output="forder.*opt=0.*took")
options(op2)
options(op)
test(2139.99, forderv(data.table(a=1), lazy=c(TRUE, TRUE)), error="lazy must be")

32 changes: 20 additions & 12 deletions src/bmerge.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,11 @@ static Rboolean rollToNearest=FALSE;

void bmerge_r(int xlowIn, int xuppIn, int ilowIn, int iuppIn, int col, int thisgrp, int lowmax, int uppmax);

SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted, SEXP xoArg, SEXP rollarg, SEXP rollendsArg, SEXP nomatchArg, SEXP multArg, SEXP opArg, SEXP nqgrpArg, SEXP nqmaxgrpArg) {
SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP rollarg, SEXP rollendsArg, SEXP nomatchArg, SEXP multArg, SEXP opArg, SEXP nqgrpArg, SEXP nqmaxgrpArg) {
const bool verbose = GetVerbose();
double tic=0.0, tic0=0.0;
if (verbose)
tic = omp_get_wtime();
int xN, iN, protecti=0;
ctr=0; // needed for non-equi join case
SEXP retFirstArg, retLengthArg, retIndexArg, allLen1Arg, allGrp1Arg;
Expand Down Expand Up @@ -138,17 +142,15 @@ SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted, SE
allGrp1[0] = TRUE;
protecti += 2;

// isorted arg
o = NULL;
if (!LOGICAL(isorted)[0]) {
SEXP order = PROTECT(allocVector(INTSXP, length(icolsArg)));
protecti++;
for (int j=0; j<LENGTH(order); j++) INTEGER(order)[j]=1; // rep(1L, length(icolsArg))
SEXP oSxp = PROTECT(forder(i, icolsArg, ScalarLogical(FALSE), ScalarLogical(TRUE), order, ScalarLogical(FALSE)));
protecti++;
// TODO - split head of forder into C-level callable
if (!LENGTH(oSxp)) o = NULL; else o = INTEGER(oSxp);
}
SEXP order = PROTECT(allocVector(INTSXP, length(icolsArg)));
protecti++;
for (int j=0; j<LENGTH(order); j++)
INTEGER(order)[j]=1; // rep(1L, length(icolsArg))
SEXP oSxp = PROTECT(forder(i, icolsArg, ScalarLogical(FALSE), ScalarLogical(TRUE), order, ScalarLogical(FALSE), ScalarLogical(TRUE))); protecti++;
if (!LENGTH(oSxp))
o = NULL;
else
o = INTEGER(oSxp);

// xo arg
xo = NULL;
Expand All @@ -159,10 +161,14 @@ SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted, SE

// start bmerge
if (iN) {
if (verbose)
tic0 = omp_get_wtime();
// embarassingly parallel if we've storage space for nqmaxgrp*iN
for (int kk=0; kk<nqmaxgrp; kk++) {
bmerge_r(-1,xN,-1,iN,scols,kk+1,1,1);
}
if (verbose)
Rprintf("bmerge: looping bmerge_r took %.3fs\n", omp_get_wtime()-tic0);
}
ctr += iN;
if (nqmaxgrp > 1 && mult == ALL) {
Expand Down Expand Up @@ -193,6 +199,8 @@ SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted, SE
Free(retLength);
Free(retIndex);
}
if (verbose)
Rprintf("bmerge: took %.3fs\n", omp_get_wtime()-tic);
UNPROTECT(protecti);
return (ans);
}
Expand Down
5 changes: 3 additions & 2 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,8 @@ int checkOverAlloc(SEXP x);
// forder.c
int StrCmp(SEXP x, SEXP y);
uint64_t dtwiddle(void *p, int i);
SEXP forder(SEXP DT, SEXP by, SEXP retGrp, SEXP sortStrArg, SEXP orderArg, SEXP naArg);
SEXP forderDo(SEXP DT, SEXP by, SEXP retGrp, SEXP sortStrArg, SEXP orderArg, SEXP naArg);
SEXP forder(SEXP DT, SEXP by, SEXP retGrp, SEXP sortStrArg, SEXP orderArg, SEXP naArg, SEXP lazyArg); // lazy wrapper to forderDo
int getNumericRounding_C();

// reorder.c
Expand Down Expand Up @@ -170,7 +171,7 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols,
SEXP on, SEXP verbose);

// bmerge.c
SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg, SEXP isorted,
SEXP bmerge(SEXP iArg, SEXP xArg, SEXP icolsArg, SEXP xcolsArg,
SEXP xoArg, SEXP rollarg, SEXP rollendsArg, SEXP nomatchArg,
SEXP multArg, SEXP opArg, SEXP nqgrpArg, SEXP nqmaxgrpArg);

Expand Down
Loading