From 4684f67a2fe7563ddc0682555a564ed02c37e9ee Mon Sep 17 00:00:00 2001
From: Michael Chirico <chiricom@google.com>
Date: Thu, 29 Aug 2024 16:48:43 +0000
Subject: [PATCH] cbindlist

add cbind by reference, timing

R prototype of mergelist

wording

use lower overhead funs

stick to int32 for now, correct R_alloc

bmerge C refactor for codecov and one loop for speed

address revealed codecov gaps

refactor vecseq for codecov

seqexp helper, some alloccol export on C

bmerge codecov, types handled in R bmerge already

better comment seqexp

bmerge mult=error #655

multiple new C utils

swap if branches

explain new C utils

comments mostly

reduce conflicts to PR #4386

comment C code

address multiple matches during update-on-join #3747

Revert "address multiple matches during update-on-join #3747"

This reverts commit b64c0c3480fe9415bbda6729c361621e60da6e01.

merge.dt has temporarily mult arg, for testing

minor changes to cbindlist c

dev mergelist, for single pair now

add quiet option to cc()

mergelist tests

add check for names to perhaps.dt

rm mult from merge.dt method

rework, clean, polish multer, fix righ and full joins

make full join symmetric

mergepair inner function to loop on

extra check for symmetric

mergelist manual

ensure no df-dt passed where list expected

comments and manual

handle 0 cols tables

more tests

more tests and debugging

move more logic closer to bmerge, simplify mergepair

more tests

revert not used changes

reduce not needed checks, cleanup

copy arg behavior, manual, no tests yet

cbindlist manual, export both

cleanup processing bmerge to dtmatch

test function match order for easier preview

vecseq gets short-circuit

batch test allow browser

big cleanup

remmove unneeded stuff, reduce diff

more cleanup, minor manual fixes

add proper test scripts

Merge branch 'master' into cbind-merge-list

comment out not used code for coverage

more tests, some nocopy opts

rename sql test script, should fix codecov

simplify dtmatch inner branch

more precise copy, now copy only T or F

unused arg not yet in api, wording

comments and refer issues

codecov

hasindex coverage

codecov gap

tests for join using key, cols argument

fix missing import forderv

more tests, improve missing on handling

more tests for order of inner and full join for long keys

new allow.cartesian option, #4383, #914

reduce diff, improve codecov

reduce diff, comments

need more DT, not lists, mergelist 3+ tbls

proper escape heavy check

unit tests

more tests, address overalloc failure

mergelist and cbindlist retain index

manual, examples

fix manual

minor clarify in manual

retain keys, right outer join for snowflake schema joins

duplicates in cbindlist

recycling in cbindlist

escape 0 input in copyCols

empty input handling

closing cbindlist

vectorized _on_ and _join.many_ arg

rename dtmatch to dtmerge

vectorized args: how, mult
push down input validation
add support for cross join, semi join, anti join

full join, reduce overhead for mult=error

mult default value dynamic

fix manual

add "see details" to Rd

mention shared on in arg description

amend feedback from Michael

semi and anti joins will not reorder x columns

Merge branch 'master' into cbind-merge-list

spelling, thx to @jan-glx

check all new funs used and add comments

bugfix, sort=T needed for now

Merge branch 'master' into cbind-merge-list

Update NEWS.md

Merge branch 'master' into cbind-merge-list

Merge branch 'master' into cbind-merge-list

NEWS placement

numbering

ascArg->order

Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list

attempt to restore from master

Update to stopf() error style

Need isFrame for now

More quality checks: any(!x)->!all(x); use vapply_1{b,c,i}

really restore from master

try to PROTECT() before duplicate()

update error message in test

appease the rchk gods

extraneous space

missing ';'

use catf

simplify perhapsDataTableR

move sqlite.Rraw.manual into other.Rraw

simplify for loop

Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list
---
 R/data.table.R |  2 +-
 src/bmerge.c   | 60 +++++++++++++++++++++++++++++++++-----------------
 src/vecseq.c   | 18 ++++++++++-----
 3 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/R/data.table.R b/R/data.table.R
index 7b48704a1..fe0f42a56 100644
--- a/R/data.table.R
+++ b/R/data.table.R
@@ -199,7 +199,7 @@ replace_dot_alias = function(e) {
     }
     return(x)
   }
-  if (!mult %chin% c("first","last","all")) stopf("mult argument can only be 'first', 'last' or 'all'")
+  if (!mult %chin% c("first", "last", "all")) stopf("mult argument can only be 'first', 'last' or 'all'")
   missingroll = missing(roll)
   if (length(roll)!=1L || is.na(roll)) stopf("roll must be a single TRUE, FALSE, positive/negative integer/double including +Inf and -Inf or 'nearest'")
   if (is.character(roll)) {
diff --git a/src/bmerge.c b/src/bmerge.c
index f6f640e71..47d6a1456 100644
--- a/src/bmerge.c
+++ b/src/bmerge.c
@@ -49,8 +49,10 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
   // iArg, xArg, icolsArg and xcolsArg
   idtVec = SEXPPTR_RO(idt);  // set globals so bmerge_r can see them.
   xdtVec = SEXPPTR_RO(xdt);
-  if (!isInteger(icolsArg)) internal_error(__func__, "icols is not integer vector"); // # nocov
-  if (!isInteger(xcolsArg)) internal_error(__func__, "xcols is not integer vector"); // # nocov
+  if (!isInteger(icolsArg))
+    internal_error(__func__, "icols is not integer vector"); // # nocov
+  if (!isInteger(xcolsArg))
+    internal_error(__func__, "xcols is not integer vector"); // # nocov
   if ((LENGTH(icolsArg)==0 || LENGTH(xcolsArg)==0) && LENGTH(idt)>0) // We let through LENGTH(i) == 0 for tests 2126.*
     internal_error(__func__, "icols and xcols must be non-empty integer vectors");
   if (LENGTH(icolsArg) > LENGTH(xcolsArg)) internal_error(__func__, "length(icols) [%d] > length(xcols) [%d]", LENGTH(icolsArg), LENGTH(xcolsArg)); // # nocov
@@ -60,10 +62,14 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
   iN = ilen = anslen = LENGTH(idt) ? LENGTH(VECTOR_ELT(idt,0)) : 0;
   ncol = LENGTH(icolsArg);    // there may be more sorted columns in x than involved in the join
   for(int col=0; col<ncol; col++) {
-    if (icols[col]==NA_INTEGER) internal_error(__func__, "icols[%d] is NA", col); // # nocov
-    if (xcols[col]==NA_INTEGER) internal_error(__func__, "xcols[%d] is NA", col); // # nocov
-    if (icols[col]>LENGTH(idt) || icols[col]<1) error(_("icols[%d]=%d outside range [1,length(i)=%d]"), col, icols[col], LENGTH(idt));
-    if (xcols[col]>LENGTH(xdt) || xcols[col]<1) error(_("xcols[%d]=%d outside range [1,length(x)=%d]"), col, xcols[col], LENGTH(xdt));
+    if (icols[col]==NA_INTEGER)
+      internal_error(__func__, "icols[%d] is NA", col); // # nocov
+    if (xcols[col]==NA_INTEGER)
+      internal_error(__func__, "xcols[%d] is NA", col); // # nocov
+    if (icols[col]>LENGTH(idt) || icols[col]<1)
+      error(_("icols[%d]=%d outside range [1,length(i)=%d]"), col, icols[col], LENGTH(idt));
+    if (xcols[col]>LENGTH(xdt) || xcols[col]<1)
+      error(_("xcols[%d]=%d outside range [1,length(x)=%d]"), col, xcols[col], LENGTH(xdt));
     int it = TYPEOF(VECTOR_ELT(idt, icols[col]-1));
     int xt = TYPEOF(VECTOR_ELT(xdt, xcols[col]-1));
     if (iN && it!=xt)
@@ -75,11 +81,14 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
   // rollArg, rollendsArg
   roll = 0.0; rollToNearest = FALSE;
   if (isString(rollarg)) {
-    if (strcmp(CHAR(STRING_ELT(rollarg,0)),"nearest") != 0) error(_("roll is character but not 'nearest'"));
-    if (ncol>0 && TYPEOF(VECTOR_ELT(idt, icols[ncol-1]-1))==STRSXP) error(_("roll='nearest' can't be applied to a character column, yet."));
+    if (strcmp(CHAR(STRING_ELT(rollarg,0)),"nearest") != 0)
+      error(_("roll is character but not 'nearest'"));
+    if (ncol>0 && TYPEOF(VECTOR_ELT(idt, icols[ncol-1]-1))==STRSXP)
+      error(_("roll='nearest' can't be applied to a character column, yet."));
     roll=1.0; rollToNearest=TRUE;       // the 1.0 here is just any non-0.0, so roll!=0.0 can be used later
   } else {
-    if (!isReal(rollarg)) internal_error(__func__, "roll is not character or double"); // # nocov
+    if (!isReal(rollarg))
+      internal_error(__func__, "roll is not character or double"); // # nocov
     roll = REAL(rollarg)[0];   // more common case (rolling forwards or backwards) or no roll when 0.0
   }
   rollabs = fabs(roll);
@@ -98,10 +107,14 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
   }
 
   // mult arg
-  if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "all")) mult = ALL;
-  else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "first")) mult = FIRST;
-  else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "last")) mult = LAST;
-  else internal_error(__func__, "invalid value for 'mult'"); // # nocov
+  if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "all"))
+    mult = ALL;
+  else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "first"))
+    mult = FIRST;
+  else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "last"))
+    mult = LAST;
+  else
+    internal_error(__func__, "invalid value for 'mult'"); // # nocov
 
   // opArg
   if (!isInteger(opArg) || length(opArg)!=ncol)
@@ -132,7 +145,8 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
     retLength = R_Calloc(anslen, int);
     retIndex = R_Calloc(anslen, int);
     // initialise retIndex here directly, as next loop is meant for both equi and non-equi joins
-    for (int j=0; j<anslen; j++) retIndex[j] = j+1;
+    for (int j=0; j<anslen; j++)
+      retIndex[j] = j+1;
   } else { // equi joins (or) non-equi join but no multiple matches
     retFirstArg = PROTECT(allocVector(INTSXP, anslen));
     retFirst = INTEGER(retFirstArg);
@@ -145,9 +159,11 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
   for (int j=0; j<anslen; j++) {
     // defaults need to populated here as bmerge_r may well not touch many locations, say if the last row of i is before the first row of x.
     retFirst[j] = nomatch;   // default to no match for NA goto below
-    // retLength[j] = 0;   // TO DO: do this to save the branch below and later branches at R level to set .N to 0
-    retLength[j] = nomatch==0 ? 0 : 1;
   }
+  // retLength[j] = 0;   // TO DO: do this to save the branch below and later branches at R level to set .N to 0
+  int retLengthVal = (int)(nomatch != 0);
+  for (int j=0; j<anslen; j++)
+    retLength[j] = retLengthVal;
 
   // allLen1Arg
   allLen1Arg = PROTECT(allocVector(LGLSXP, 1));
@@ -174,7 +190,8 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
   // xo arg
   xo = NULL;
   if (length(xoArg)) {
-    if (!isInteger(xoArg)) internal_error(__func__, "xoArg is not an integer vector"); // # nocov
+    if (!isInteger(xoArg))
+      internal_error(__func__, "xoArg is not an integer vector"); // # nocov
     xo = INTEGER(xoArg);
   }
 
@@ -391,10 +408,13 @@ void bmerge_r(int xlowIn, int xuppIn, int ilowIn, int iuppIn, int col, int thisg
       // final two 1's are lowmax and uppmax
     } else {
       int len = xupp-xlow-1+rollLow+rollUpp; // rollLow and rollUpp cannot both be true
-      if (mult==ALL && len>1) allLen1[0] = FALSE;
+      if (len>1) {
+        if (mult==ALL)
+          allLen1[0] = FALSE;                           // bmerge()$allLen1
+      }
       if (nqmaxgrp == 1) {
-        const int rf = (mult!=LAST) ? xlow+2-rollLow : xupp+rollUpp; // extra +1 for 1-based indexing at R level
-        const int rl = (mult==ALL) ? len : 1;
+        const int rf = (mult!=LAST) ? xlow+2-rollLow : xupp+rollUpp; // bmerge()$starts thus extra +1 for 1-based indexing at R level
+        const int rl = (mult==ALL) ? len : 1;                        // bmerge()$lens
         for (int j=ilow+1; j<iupp; j++) {   // usually iterates once only for j=ir
           const int k = o ? o[j]-1 : j;
           retFirst[k] = rf;
diff --git a/src/vecseq.c b/src/vecseq.c
index 9466c615b..707439ed4 100644
--- a/src/vecseq.c
+++ b/src/vecseq.c
@@ -10,9 +10,12 @@ SEXP vecseq(SEXP x, SEXP len, SEXP clamp)
   // Specially for use by [.data.table after binary search. Now so specialized that for general use
   // bit::vecseq is recommended (Jens has coded it in C now).
 
-  if (!isInteger(x)) error(_("x must be an integer vector"));
-  if (!isInteger(len)) error(_("len must be an integer vector"));
-  if (LENGTH(x) != LENGTH(len)) error(_("x and len must be the same length"));
+  if (!isInteger(x))
+    error(_("x must be an integer vector")); // # nocov
+  if (!isInteger(len))
+    error(_("len must be an integer vector")); // # nocov
+  if (LENGTH(x) != LENGTH(len))
+    error(_("x and len must be the same length")); // # nocov
   const int *ix = INTEGER(x);
   const int *ilen = INTEGER(len), nlen=LENGTH(len);
   int reslen = 0;
@@ -22,10 +25,13 @@ SEXP vecseq(SEXP x, SEXP len, SEXP clamp)
     reslen += ilen[i];
   }
   if (!isNull(clamp)) {
-    if (!isNumeric(clamp) || LENGTH(clamp)!=1) error(_("clamp must be a double vector length 1"));
+    if (!isNumeric(clamp) || LENGTH(clamp)!=1)
+      error(_("clamp must be a double vector length 1")); // # nocov
     double limit = REAL(clamp)[0];
-    if (limit<0) error(_("clamp must be positive"));
-    if (reslen>limit) error(_("Join results in %d rows; more than %d = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."), reslen, (int)limit);
+    if (limit<0)
+      error(_("clamp must be positive")); // # nocov
+    if (reslen>limit)
+      error(_("Join results in %d rows; more than %d = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."), reslen, (int)limit);
   }
   SEXP ans = PROTECT(allocVector(INTSXP, reslen));
   int *ians = INTEGER(ans);