Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbindlist protection stack overflow #5448

Merged
merged 13 commits into from
Jul 13, 2024
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@

7. `melt` returns an integer column for `variable` when `measure.vars` is a list of length=1, consistent with the documented behavior, [#5209](https://github.com/Rdatatable/data.table/issues/5209). Thanks to @tdhock for reporting and fixing. Any users who were relying on this behavior can change `measure.vars=list("col_name")` (output `variable` was column name, now is column index/integer) to `measure.vars="col_name"` (`variable` still is column name).

8. `rbindlist` could lead to a protection stack overflow when applied to a list containing many nested lists exceeding the pointer protection stack size, [#4536](https://github.com/Rdatatable/data.table/issues/4536). Thanks to @ProfFancyPants for reporting, and Benjamin Schwendinger for the fix.

## NOTES

1. `transform` method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1.
Expand Down
12 changes: 12 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18566,3 +18566,15 @@ test(2261.04, setNumericRounding(2L), 1L)
# or not an object is an invisible copy or not, and prints it anyways.
test(2261.05, capture.output(setNumericRounding(2L)), character(0))
setNumericRounding(old)

# rbindlist with many nested lists leads to protection stack overflow #4536
x = list(
data.table(V1="a", V4=1L, V5="", V14=TRUE),
data.table(V1="", V4=0L, V5=""),
data.table(V2=list(FALSE), V6="", V7=FALSE, V8=FALSE, V9="", V10="", V11=TRUE),
data.table(V2=list(list(V2A=FALSE, V2B="", V2C="" )), V6="", V7=TRUE, V8=FALSE, V12="", V9="", V15="", V10="", V16=FALSE, V11=FALSE, V17=FALSE),
data.table(V18="", V19="", V1=list(list(V1A="", V1B="", V1C="", V1D="", V1E="")), V20="", V21="", V3 = list(list(V3A=TRUE, V3B="", V3C=1L, V3D=0L, V3F=1L)), V13="", V14=TRUE)
)
N = 5e4
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
y = x[rep(1:5, N)]
test(2262, rbindlist(y, fill=TRUE), rbindlist(x, fill=TRUE)[rep(1:5, N)])
4 changes: 3 additions & 1 deletion src/rbindlist.c
Original file line number Diff line number Diff line change
Expand Up @@ -515,13 +515,15 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg)
if (w==-1 || !length(thisCol=VECTOR_ELT(li, w))) { // !length for zeroCol warning above; #1871
writeNA(target, ansloc, thisnrow, false); // writeNA is integer64 aware and writes INT64_MIN
} else {
bool listprotect = false;
if ((TYPEOF(target)==VECSXP || TYPEOF(target)==EXPRSXP) && TYPEOF(thisCol)!=TYPEOF(target)) {
// do an as.list() on the atomic column; #3528
thisCol = PROTECT(coerceVector(thisCol, TYPEOF(target))); nprotect++;
thisCol = PROTECT(coerceVector(thisCol, TYPEOF(target))); listprotect = true;
}
// else coerces if needed within memrecycle; with a no-alloc direct coerce from 1.12.4 (PR #3909)
const char *ret = memrecycle(target, R_NilValue, ansloc, thisnrow, thisCol, 0, -1, idcol+j+1, foundName);
if (ret) warning(_("Column %d of item %d: %s"), w+1, i+1, ret);
if (listprotect) UNPROTECT(1); // earlier unprotect rbindlist calls with lots of lists #4536
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the UNPROTECT() precede the warning() for the case when options(warn=2)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I guess we have a lot of cases where we either error or warn before calling UNPROTECT. Should we check with R-devel for the recommended handling of these cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we're meant to always UNPROTECT() before signalling error() or warning(), e.g. from WRE:

#include <R.h>
#include <Rinternals.h>
#include <R_ext/Parse.h>

SEXP menu_ttest3()
{
    char cmd[256];
    SEXP cmdSexp, cmdexpr, ans = R_NilValue;
    ParseStatus status;
   ...
    if(done == 1) {
        cmdSexp = PROTECT(allocVector(STRSXP, 1));
        SET_STRING_ELT(cmdSexp, 0, mkChar(cmd));
        cmdexpr = PROTECT(R_ParseVector(cmdSexp, -1, &status, R_NilValue));
        if (status != PARSE_OK) {
            UNPROTECT(2);
            error("invalid call %s", cmd);
        }
        /* Loop is needed here as EXPSEXP will be of length > 1 */
        for(int i = 0; i < length(cmdexpr); i++)
            ans = eval(VECTOR_ELT(cmdexpr, i), R_GlobalEnv);
        UNPROTECT(2);
    }
    return ans;
}

IINM {rchk} is supposed to help ferret those out, so I'm not sure we have many violations for the case of error(), can you easily find any?

We might want to report to {rchk} maintainer T. Kalibera about possibly catching usages like warning(); UNPROTECT(); as well by that package.

cc @jangorecki as well who has worked more extensively here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a glance at assign.c and between.c but both contain those "violations".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's file a follow-up

// e.g. when precision is lost like assigning 3.4 to integer64; test 2007.2
// TODO: but maxType should handle that and this should never warn
}
Expand Down
Loading