Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

last/first get argument na.rm #5168

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
f976102
new syntax
ben-schwen Sep 18, 2021
a2c067a
add coverage
ben-schwen Sep 18, 2021
8968895
added consistency
ben-schwen Sep 18, 2021
69d90a7
add coverage
ben-schwen Sep 18, 2021
6355d63
update .SD optimization
ben-schwen Oct 17, 2021
16b16f6
merge master
ben-schwen Oct 17, 2021
dcf49ba
coverage
ben-schwen Oct 17, 2021
7966d57
update docs
ben-schwen Oct 17, 2021
c5d2bc0
fix NEWS
ben-schwen Oct 17, 2021
be5a1f1
update tests
ben-schwen Oct 17, 2021
c6c1d42
turn on lapply optimization for head/tail
ben-schwen Oct 17, 2021
b04b51f
merge master
mattdowle Dec 13, 2021
91bae5e
Merge branch 'master' into last_narm
mattdowle Dec 16, 2021
3a45476
Merge branch 'master' into last_narm
mattdowle Dec 16, 2021
6d52148
Merge branch 'master' into last_narm
mattdowle Dec 21, 2021
384da9b
creating 'c' variable in new tests caused cc() to fail on test 1035.0…
mattdowle Dec 21, 2021
dad0537
remove && \!narm_arg(first, jsub) temporarily to confirm it wasn't te…
mattdowle Dec 31, 2021
58195f4
rework tests & last.R, add na.rm='row'
mattdowle Jan 20, 2022
d8e76e4
add TRUE/'row' to news item, add last(.SD) tests by group
mattdowle Jan 24, 2022
75676b7
.(last(one col), first(another col)) by group optimized
mattdowle Jan 24, 2022
e520cab
n>1 with na.rm=TRUE too added to gfirst and glast, reworked gforce_ok…
mattdowle Feb 8, 2022
0a07595
added 18.7s down to 0.1s example to news item
mattdowle Feb 8, 2022
792d948
test first/last of .SD with na.rm=TRUE and na.rm='row' by group with …
mattdowle Feb 12, 2022
a2292ff
test error if n<0
mattdowle Feb 12, 2022
e573daa
add to news item prior syntax for na.rm='row'
mattdowle Feb 12, 2022
4ca0838
cover error if na.rm is not FALSE, TRUE or 'row' when optimized and n…
mattdowle Feb 12, 2022
b3b102b
include all-NA list group and some NULL
mattdowle Feb 14, 2022
a37e734
final todo: NULL->NA in list columns when na.rm='row'
mattdowle Feb 14, 2022
3efdcb4
coverage
mattdowle Feb 14, 2022
498d168
add na.rm=row to last.Rd
mattdowle Feb 14, 2022
a9e14b9
address Michael's feedback
mattdowle Feb 16, 2022
31accc5
add details to ?last explaining why not na.rm='col' and asking for fe…
mattdowle Feb 16, 2022
cb8f201
fix edgy case n=0 thanks to Ben
mattdowle Feb 17, 2022
29c6825
coverage
mattdowle Feb 18, 2022
08d6414
return number of non-NA by group as attribute
mattdowle Feb 20, 2022
783d9d9
interim
mattdowle Mar 12, 2022
2aed4cc
merge master
mattdowle Mar 16, 2022
c28786a
Merge branch 'master' into last_narm
mattdowle Mar 17, 2022
94eb6ed
Added gforce_dynamic attrib returned by gfirstlast and gshift to gfor…
mattdowle Aug 2, 2022
394556a
merge master
mattdowle Aug 2, 2022
4964908
repeat tests with optimization off and on
mattdowle Aug 4, 2022
c9f5507
top align
mattdowle Aug 4, 2022
b0e6ba1
first/last return 'true vectors' so dogroups.c knows not to recycle l…
mattdowle Aug 7, 2022
9256ad7
comments only
mattdowle Aug 8, 2022
beded95
catch first/last n>1 := by group with helpful error
mattdowle Aug 11, 2022
966f00b
wip shift multiple n return data.table rather than list
mattdowle Sep 12, 2022
324ce38
Merge branch 'master' into last_narm
MichaelChirico Aug 2, 2024
0fb58b5
one more resolution
MichaelChirico Aug 3, 2024
48281d5
more resolution
MichaelChirico Aug 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,78 @@

18. New `setdroplevels()` as a by-reference version of the `droplevels()` method, which returns a copy of its input, [#6014](https://github.com/Rdatatable/data.table/issues/6014). Thanks @MichaelChirico for the suggestion and implementation.

19. `first()` and `last()` gain `na.rm` taking values `FALSE` (default), `TRUE` or `"row"`, [#4239](https://github.com/Rdatatable/data.table/issues/4239). For vector input, `TRUE` and `"row"` are the same. For `data.table|frame` input, `TRUE` returns the first/last non-NA observation in each column, while `"row"` returns the first/last row where all columns are non-NA. `TRUE` is optimized by group and `"row"` may be optimized by group in future. `n>1` with `na.rm=TRUE` is also optimized by group. Thanks to Nicolas Bennett and Michael Chirico for the requests, and Benjamin Schwendinger for the PR.

```R
x
# [1] NA 1 2 NA

first(x)
# NA

first(x, na.rm=TRUE)
# 1

last(x, na.rm=TRUE)
# 2

DT
# grp A B
# <int> <int> <int>
#1: 1 3 7
#2: 1 4 NA
#3: 2 5 NA
#4: 2 6 NA

last(DT, na.rm=TRUE)
# grp A B
# <int> <int> <int>
#1: 2 6 7

last(DT, na.rm="row")
# grp A B
# <int> <int> <int>
#1: 1 3 7

DT[, last(.SD, na.rm=TRUE), by=grp]
# grp A B
# <int> <int> <int>
#1: 1 4 7
#2: 2 6 NA

DT[, last(.SD, na.rm="row"), by=grp]
# grp A B
# <int> <int> <int>
#1: 1 3 7
#2: 2 NA NA

DT[, last(na.omit(.SD)), by=grp] # same as na.rm='row' but drops all-NA groups
# grp A B
# <int> <int> <int>
#1: 1 3 7

set.seed(1)
DT = data.table(id=rep(1:1e6, each=10),
v=sample(c(1:5,NA), 10e6, replace=TRUE))
DT
# id v
# <int> <int>
# 1: 1 2
# 2: 1 3
# 3: 1 4
# 4: 1 NA
# 5: 1 2
# ---
# 9999996: 1000000 3
# 9999997: 1000000 NA
# 9999998: 1000000 NA
# 9999999: 1000000 1
# 10000000: 1000000 4
ans1 = DT[, last(na.omit(v)), by=id] # 18.7 sec
ans2 = DT[, last(v, na.rm=TRUE), by=id] # 0.1 sec
identical(ans1, ans2) # TRUE
```

## BUG FIXES

1. `unique()` returns a copy the case when `nrows(x) <= 1` instead of a mutable alias, [#5932](https://github.com/Rdatatable/data.table/pull/5932). This is consistent with existing `unique()` behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix.
Expand Down
19 changes: 11 additions & 8 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1653,7 +1653,8 @@ replace_dot_alias = function(e) {
(jsub %iscall% "[[" && is.name(jsub[[2L]]) && eval(call('is.atomic', jsub[[2L]]), x, parent.frame()))) &&
(is.numeric(jsub[[3L]]) || jsub[[3L]] == ".N")
headopt = jsub %iscall% c("head", "tail")
firstopt = jsub %iscall% c("first", "last") # fix for #2030
firstopt = jsub %iscall% c("first", "last") && # 2030, 4239
!identical(match.call(first, jsub)[["na.rm"]], "row") # first's signature same as last's
if ((length(jsub) >= 2L && jsub[[2L]] == ".SD") &&
(subopt || headopt || firstopt)) {
if (headopt && length(jsub)==2L) jsub[["n"]] = 6L # head-tail n=6 when missing #3462
Expand Down Expand Up @@ -1860,7 +1861,9 @@ replace_dot_alias = function(e) {
assign(".N", len__, thisEnv) # For #334
#fix for #1683
if (use.I) assign(".I", seq_len(nrow(x)), thisEnv)
ans = gforce(thisEnv, jsub, o__, f__, len__, irows) # irows needed for #971.
ans = gforce(thisEnv, jsub, o__, f__, len__, irows, # irows needed for #971
.Call(CsubsetVector, groups, grpcols), # just a list() subset to make C level neater; doesn't copy column contents
lhs) # for now this just prevents := with new feature first/last n>1; in future see TODO below
gi = if (length(o__)) o__[f__] else f__
g = lapply(grpcols, function(i) .Call(CsubsetVector, groups[[i]], gi)) # use CsubsetVector instead of [ to preserve attributes #5567

Expand Down Expand Up @@ -1922,10 +1925,10 @@ replace_dot_alias = function(e) {
# Grouping by by: i is by val, icols NULL, o__ may be subset of x, f__ points to o__ (or x if !length o__)
# TO DO: setkey could mark the key whether it is unique or not.
if (!is.null(lhs)) {
if (GForce) { # GForce should work with := #1414
vlen = length(ans[[1L]])
if (GForce) { # GForce should work with := #1414. TODO: move down into gforce at C level to save creating/rep'ing ans and grpcols wastefully
vlen = length(ans[[1L]]) # TODO: this might be ngrp when na.rm=TRUE and one group has 2 and another 0, so needs enhancing here (by passing all-1 back from gans?)
# replicate vals if GForce returns 1 value per group
jvals = if (vlen==length(len__)) lapply(tail(ans, -length(g)), rep, times=len__) else tail(ans, -length(g)) # see comment in #4245 for why rep instead of rep.int
jvals = if (vlen==length(len__)) lapply(tail(ans, -length(grpcols)), rep, times=len__) else tail(ans, -length(grpcols)) # see comment in #4245 for why rep instead of rep.int
jrows = vecseq(f__,len__,NULL)
if (length(o__)) jrows = o__[jrows]
if (length(irows)) jrows = irows[jrows]
Expand Down Expand Up @@ -3016,8 +3019,8 @@ gfuns = c(gdtfuns,
`g[` = `g[[` = function(x, n) .Call(Cgnthvalue, x, as.integer(n)) # n is of length=1 here.
ghead = function(x, n) .Call(Cghead, x, as.integer(n))
gtail = function(x, n) .Call(Cgtail, x, as.integer(n))
gfirst = function(x) .Call(Cgfirst, x)
glast = function(x) .Call(Cglast, x)
gfirst = function(x, n=1L, na.rm=FALSE) .Call(Cgfirst, x, as.integer(n), na.rm)
glast = function(x, n=1L, na.rm=FALSE) .Call(Cglast, x, as.integer(n), na.rm)
gsum = function(x, na.rm=FALSE) .Call(Cgsum, x, na.rm)
gmean = function(x, na.rm=FALSE) .Call(Cgmean, x, na.rm)
gweighted.mean = function(x, w, ..., na.rm=FALSE) {
Expand All @@ -3042,7 +3045,7 @@ gshift = function(x, n=1L, fill=NA, type=c("lag", "lead", "shift", "cyclic")) {
stopifnot(is.numeric(n))
.Call(Cgshift, x, as.integer(n), fill, type)
}
gforce = function(env, jsub, o, f, l, rows) .Call(Cgforce, env, jsub, o, f, l, rows)
gforce = function(env, jsub, o, f, l, rows, grpcols, lhs) .Call(Cgforce, env, jsub, o, f, l, rows, grpcols, lhs)

# GForce needs to evaluate all arguments not present in the data.table before calling C part #5547
# Safe cases: variables [i], calls without variables [c(0,1), list(1)] # TODO extend this list
Expand Down
148 changes: 73 additions & 75 deletions R/last.R
Original file line number Diff line number Diff line change
@@ -1,84 +1,82 @@
# data.table defined last(x) with no arguments, just for last. If you need the last 10 then use tail(x,10).
# for xts class objects it will dispatch to xts::last
# reworked to avoid loading xts namespace (#3857) then again to fix dispatching of xts class (#4053)
last = function(x, n=1L, ...) {
verbose = isTRUE(getOption("datatable.verbose", FALSE))
if (!inherits(x, "xts")) {
if (nargs()>1L) {
if ("package:xts" %chin% search()) {
if (verbose)
catf("%s: using %s: %s\n", "last", "xts::last", "!is.xts(x) & nargs>1 & 'package:xts'%in%search()")
xts::last(x, n=n, ...)
} else {
# nocov start
if (verbose)
catf("%s: using %s: %s\n", "last", "utils::tail", "!is.xts(x) & nargs>1 & !'package:xts'%in%search()")
utils::tail(x, n=n, ...)
# nocov end
}
# data.table originally defined first(x) and last(x) with no arguments just for the single
# first/last observation. Over time n= has been added since xts::last has n so now it makes
# sense to support n. The difference to head/tail is the default n=1 vs n=6, and
# that first/last are not generic for speed by group.

first = function(x, n=1L, na.rm=FALSE, ...) {
.firstlast(x, n=n, na.rm=na.rm, first=TRUE, ...)
}

last = function(x, n=1L, na.rm=FALSE, ...) {
.firstlast(x, n=n, na.rm=na.rm, first=FALSE, ...)
}

.firstlast = function(x, n=1L, na.rm=FALSE, first=TRUE, ...) {
if (inherits(x, "xts")) {
if (isTRUE(getOption("datatable.verbose", FALSE)))
catf("using %s\n", if (first) "xts::first" else "xts::last")
return((if (first) xts::first else xts::last)(x, n=n, na.rm=na.rm, ...))
}
stopifnot(isTRUEorFALSE(na.rm) || identical(na.rm,"row"))
stopifnot(is.numeric(n), length(n)==1L, n>=0L)
n = as.integer(n)
if (is.data.frame(x)) {
if (!nrow(x)) return(x)
if (identical(na.rm, "row")) { # any NA on the row removes that row
nna = which_(.Call(Cdt_na, x, seq_along(x)), bool=FALSE)
# very similar to na.omit.data.table
# TODO: n and first/last could be passed to Cdt_na and it could stop after finding n (it already does that in gsumm.c when gforce optimized)
nna = .firstlastVector(nna, n=n, first=first, na.rm=FALSE)
ans = .Call(CsubsetDT, x, nna, seq_along(x)) # works on DF too
} else {
dx = dim(x)
if (is.null(dx)) {
if (verbose)
catf("%s: using %s: %s\n", "last", "'x[[length(x)]]'", "!is.xts(x) & !nargs>1 & is.null(dim(x))")
lx = length(x)
if (!lx) x else x[[lx]]
} else if (is.data.frame(x)) {
if (verbose)
catf("%s: using %s: %s\n", "last", "'x[nrow(x),]'", "!is.xts(x) & !nargs>1 & is.data.frame(x)")
x[dx[1L], , drop=FALSE]
} else {
if (verbose)
catf("%s: using %s: %s\n", "last", "utils::tail", "!is.xts(x) & !nargs>1 & !is.null(dim(x)) & !is.data.frame(x)")
utils::tail(x, n=n, ...)
ans = lapply(x, .firstlastVector, n=n, first=first, na.rm=na.rm)
if (na.rm) {
l = vapply_1i(ans, length)
m = max(l)
for (i in which(l<m)) {
ans[[i]] = c(ans[[i]], rep(NA, m-l[i]))
}
# any row.names won't align to the values now in the result so don't retain them
}
}
} else {
if (!requireNamespace("xts", quietly=TRUE))
stopf("'xts' class passed to %s function but 'xts' is not available, you should have 'xts' installed already", "data.table::last") # nocov
if (verbose)
catf("%s: using %s: %s\n", "last", "xts::last", "is.xts(x)")
xts::last(x, n=n, ...)
if (is.data.table(x)) setDT(ans) else setDF(ans)
setattr(ans, "class", class(x))
if (!isTRUE(na.rm) && length(rn<-attr(x,"row.names")))
setattr(ans, "row.names", if (isFALSE(na.rm)) .firstlastVector(rn, n=n, first=first, na.rm=FALSE)
else rn[nna])
return(ans)
}
if (!length(x))
return(x)
if (!is.vector(x)) {
if (!isFALSE(na.rm))
stopf("na.rm=TRUE|'row' is not currently supported for '%s'", class(x)[1L])
return((if (first) utils::head else utils::tail)(x, n=n, ...)) # e.g. matrix
}
return(.firstlastVector(x, n=n, first=first, na.rm=!isFALSE(na.rm))) # !isFALSE to convert 'row' to TRUE
}

first = function(x, n=1L, ...) {
verbose = isTRUE(getOption("datatable.verbose", FALSE))
if (!inherits(x, "xts")) {
if (nargs()>1L) {
if ("package:xts" %chin% search()) {
if (verbose)
catf("%s: using %s: %s\n", "first", "xts::first", "!is.xts(x) & nargs>1 & 'package:xts'%in%search()")
xts::first(x, n=n, ...)
} else {
# nocov start
if (verbose)
catf("%s: using %s: %s\n", "first", "utils::head", "!is.xts(x) & nargs>1 & !'package:xts'%in%search()")
utils::head(x, n=n, ...)
# nocov end
}
} else {
dx = dim(x)
if (is.null(dx)) {
if (verbose)
catf("%s: using %s: %s\n", "first", "'x[[1L]]'", "!is.xts(x) & !nargs>1 & is.null(dim(x))")
lx = length(x)
if (!lx) x else x[[1L]]
} else if (is.data.frame(x)) {
if (verbose)
catf("%s: using %s: %s\n", "first", "'x[1L,]'", "!is.xts(x) & !nargs>1 & is.data.frame(x)")
if (!dx[1L]) x else x[1L, , drop=FALSE]
} else {
if (verbose)
catf("%s: using %s: %s\n", "first", "utils::head", "!is.xts(x) & !nargs>1 & !is.null(dim(x)) & !is.data.frame(x)")
utils::head(x, n=n, ...)
}
}
.firstlastVector = function(x, n, first, na.rm) {
if (!length(x)) return(x)
if (n==0L) return(x[0L])
ans = if (na.rm) {
nna = which_(if (is.list(x)) vapply_1b(x,function(y){is.null(y)||(length(y)==1L&&is.na(y))})
else is.na(x), bool=FALSE) # TODO: again, n and first/last could be passed to C here
if (!length(nna)) x[0L]
else {y=min(n,length(nna)); x[nna[if (first) seq.int(1L,y) else seq.int(length(nna)-y+1L,length(nna))]]}
} else {
if (!requireNamespace("xts", quietly=TRUE))
stopf("'xts' class passed to %s function but 'xts' is not available, you should have 'xts' installed already", "data.table::first") # nocov
if (verbose)
catf("%s: using %s: %s\n", "first", "xts::first", "is.xts(x)")
xts::first(x, n=n, ...)
y=min(n,length(x)); x[if (first) seq.int(1L,y) else seq.int(length(x)-y+1L,length(x))]
}
if (n>1L || na.rm) # n!=length(ans)
.Call("Csettruelength", ans, length(ans))
# for dogroups.c to know that shorter results (including when na.rm results in a length-1) should be padded with NA to match the length of longer items
# head and tail with na.rm=TRUE are by their nature returning a vector and therefore shouldn't be recycled when length-1; test 2240.81
# TODO: new function pad() could be provided so user can do things like DT[, .(pad(na.omit(B)), pad(na.omit(C))), by=grp]
# to avoid the error 'Supplied 2 items for column 1 of group 1 which has 3 rows ...'
# and/or pad= could be added to [.data.table to allow padding all results
# Since gforce_dynamic optimizes head/tail it knows to pad and that's optimized. However, default last(x) and first(x) (i.e. n=1 na.rm=FALSE) are
# single-valued like mean,median etc and are recycled in the same way. This is consistent with n=1 na.rm=FALSE already not being treated as
# gforce_dynamic in gsumm.c either. n=1 na.rm=TRUE returns empty when all-NA so is still a vector result not recycled when length-1.
ans
}

1 change: 1 addition & 0 deletions R/shift.R
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ shift = function(x, n=1L, fill, type=c("lag", "lead", "shift", "cyclic"), give.n
}
setattr(ans, "names", paste(rep(nx,each=length(n)), type, n, sep="_"))
}
if (length(n)>1L) setDT(ans)
ans
}

Expand Down
34 changes: 34 additions & 0 deletions R/test.data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,40 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no
# nocov end
}
}
if (!fail) for (type in c("warning","error","message")) {
observed = actual[[type]]
expected = get(type)
if (type=="warning" && length(observed) && !is.null(ignore.warning)) {
# if a warning containing this string occurs, ignore it. First need for #4182 where warning about 'timedatectl' only
# occurs in R 3.4, and maybe only on docker too not for users running test.data.table().
stopifnot(length(ignore.warning)==1L, is.character(ignore.warning), !is.na(ignore.warning), nchar(ignore.warning)>=1L)
observed = grep(ignore.warning, observed, value=TRUE, invert=TRUE)
}
if (length(expected) != length(observed)) {
# nocov start
catf("Test %s produced %d %ss but expected %d\n%s\n%s\n", numStr, length(observed), type, length(expected),
paste("Expected:", expected, collapse="\n"),
paste("Observed:", observed, collapse="\n"))
fail = TRUE
# nocov end
} else {
# the expected type occurred and, if more than 1 of that type, in the expected order
for (i in seq_along(expected)) {
if (!foreign && !string_match(expected[i], observed[i])) {
# nocov start
catf("Test %s didn't produce the correct %s:\nExpected: %s\nObserved: %s\n", numStr, type, expected[i], observed[i])
fail = TRUE
# nocov end
}
}
}
}
if (fail && exists("out",inherits=FALSE)) {
# nocov start
catf("Output captured before unexpected warning/error/message:\n")
writeLines(out)
# nocov end
}
if (!fail && !length(error) && (!length(output) || !missing(y))) { # TODO test y when output=, too
capture.output(y <- try(y, silent=TRUE)) # y might produce verbose output, just toss it
if (identical(x,y)) return(invisible(TRUE))
Expand Down
Loading
Loading