Rdatatable · ben-schwen · Sep 18, 2021 · Sep 18, 2021 · Sep 18, 2021 · Sep 18, 2021
@@ -292,6 +292,78 @@
     # 2:     2    10
     ```
 
+40. `first()` and `last()` gain `na.rm` taking values `FALSE` (default), `TRUE` or `"row"`, [#4239](https://github.com/Rdatatable/data.table/issues/4239). For vector input, `TRUE` and `"row"` are the same. For `data.table|frame` input, `TRUE` returns the first/last non-NA observation in each column, while `"row"` returns the first/last row where all columns are non-NA. `TRUE` is optimized by group and `"row"` may be optimized by group in future. `n>1` with `na.rm=TRUE` is also optimized by group. Thanks to Nicolas Bennett and Michael Chirico for the requests, and Benjamin Schwendinger for the PR.
+
+    ```R
+    x
+    # [1] NA  1  2 NA
+
+    first(x)
+    # NA
+
+    first(x, na.rm=TRUE)
+    # 1
+
+    last(x, na.rm=TRUE)
+    # 2
+
+    DT
+    #     grp     A     B
+    #   <int> <int> <int>
+    #1:     1     3     7
+    #2:     1     4    NA
+    #3:     2     5    NA
+    #4:     2     6    NA
+
+    last(DT, na.rm=TRUE)
+    #     grp     A     B
+    #   <int> <int> <int>
+    #1:     2     6     7
+
+    last(DT, na.rm="row")
+    #     grp     A     B
+    #   <int> <int> <int>
+    #1:     1     3     7
+
+    DT[, last(.SD, na.rm=TRUE), by=grp]
+    #     grp     A     B
+    #   <int> <int> <int>
+    #1:     1     4     7
+    #2:     2     6    NA
+
+    DT[, last(.SD, na.rm="row"), by=grp]
+    #     grp     A     B
+    #   <int> <int> <int>
+    #1:     1     3     7
+    #2:     2    NA    NA
+
+    DT[, last(na.omit(.SD)), by=grp]  # same as na.rm='row' but drops all-NA groups
+    #     grp     A     B
+    #   <int> <int> <int>
+    #1:     1     3     7
+
+    set.seed(1)
+    DT = data.table(id=rep(1:1e6, each=10),
+                     v=sample(c(1:5,NA), 10e6, replace=TRUE))
+    DT
+    #                id     v
+    #             <int> <int>
+    #        1:       1     2
+    #        2:       1     3
+    #        3:       1     4
+    #        4:       1    NA
+    #        5:       1     2
+    #       ---              
+    #  9999996: 1000000     3
+    #  9999997: 1000000    NA
+    #  9999998: 1000000    NA
+    #  9999999: 1000000     1
+    # 10000000: 1000000     4
+    ans1 = DT[, last(na.omit(v)), by=id]       # 18.7 sec
+    ans2 = DT[, last(v, na.rm=TRUE), by=id]    #  0.1 sec
+    identical(ans1, ans2)                      # TRUE
+    ```
+
 ## BUG FIXES
 
 1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
@@ -659,7 +731,7 @@
 
 1. Continuous daily testing by CRAN using latest daily R-devel revealed, within one day of the change to R-devel, that a future version of R would break one of our tests, [#4769](https://github.com/Rdatatable/data.table/issues/4769). The characters "-alike" were added into one of R's error messages, so our too-strict test which expected the error `only defined on a data frame with all numeric variables` will fail when it sees the new error message `only defined on a data frame with all numeric-alike variables`. We have relaxed the pattern the test looks for to `data.*frame.*numeric` well in advance of the future version of R being released. Readers are reminded that CRAN is not just a host for packages. It is also a giant test suite for R-devel. For more information, [behind the scenes of cran, 2016](https://www.h2o.ai/blog/behind-the-scenes-of-cran/).
 
-2. `as.Date.IDate` is no longer exported as a function to solve a new error in R-devel `S3 method lookup found 'as.Date.IDate' on search path`, [#4777](https://github.com/Rdatatable/data.table/issues/4777). The S3 method is still exported; i.e. `as.Date(x)` will still invoke the `as.Date.IDate` method when `x` is class `IDate`. The function had been exported, in addition to exporting the method, to solve a compatibility issue with `zoo` (and `xts` which uses `zoo`) because `zoo` exports `as.Date` which masks `base::as.Date`. Happily, since zoo 1.8-1 (Jan 2018) made a change to its `as.IDate`, the workaround is no longer needed.
+2. `as.Date.IDate` is no longer exported as a function to solve a new error in R-devel `S3 method lookup found 'as.Date.IDate' on search path`, [#4777](https://github.com/Rdatatable/data.table/issues/4777). The S3 method is still exported; i.e. `as.Date(x)` will still invoke the `as.Date.IDate` method when `x` is class `IDate`. The function had been exported, in addition to exporting the method, to solve a compatibility issue with `zoo` (and `xts` which uses `zoo`) because `zoo` exports `as.Date` which masks `base::as.Date`. Happily, since zoo 1.8-1 (Jan 2018) made a change to its `as.Date`, the workaround is no longer needed.
 
 3. Thanks to @fredguinog for testing `fcase` in development before 1.13.0 was released and finding a segfault, [#4378](https://github.com/Rdatatable/data.table/issues/4378). It was found separately by the `rchk` tool (which uses static code analysis) in release procedures and fixed before `fcase` was released, but the reproducible example has now been added to the test suite for completeness. Thanks also to @shrektan for investigating, proposing a very similar fix at C level, and a different reproducible example which has also been added to the test suite.
 

@@ -1620,8 +1620,9 @@ replace_dot_alias = function(e) {
         (jsub[[1L]] == "[" ||
            (jsub[[1L]] == "[[" && is.name(jsub[[2L]]) && eval(call('is.atomic', jsub[[2L]]), x, parent.frame()))) &&
         (is.numeric(jsub[[3L]]) || jsub[[3L]] == ".N")
-      headopt = jsub[[1L]] == "head" || jsub[[1L]] == "tail"
-      firstopt = jsub[[1L]] == "first" || jsub[[1L]] == "last" # fix for #2030
+      headopt =   jsub[[1L]]=="head"  || jsub[[1L]]=="tail"
+      firstopt = (jsub[[1L]]=="first" || jsub[[1L]]=="last") &&         # 2030, 4239
+                 !identical(match.call(first, jsub)[["na.rm"]], "row")  # first's signature same as last's
       if ((length(jsub) >= 2L && jsub[[2L]] == ".SD") &&
           (subopt || headopt || firstopt)) {
         if (headopt && length(jsub)==2L) jsub[["n"]] = 6L # head-tail n=6 when missing #3462
@@ -1738,22 +1739,37 @@ replace_dot_alias = function(e) {
         GForce = FALSE
       } else {
         # Apply GForce
-        .gforce_ok = function(q) {
-          if (dotN(q)) return(TRUE) # For #334
-          # run GForce for simple f(x) calls and f(x, na.rm = TRUE)-like calls where x is a column of .SD
+        .gforce_ok = function(q) {  # TODO: move outside and test directly
+          if (dotN(q)) return(TRUE) # #334
+          # run GForce for gfuns(x, ...) where x is a column of .SD 
           # is.symbol() is for #1369, #1974 and #2949
-          if (!(is.call(q) && is.symbol(q[[1L]]) && is.symbol(q[[2L]]) && (q1 <- q[[1L]]) %chin% gfuns)) return(FALSE)
-          if (!(q2 <- q[[2L]]) %chin% names(SDenv$.SDall) && q2 != ".I") return(FALSE)  # 875
-          if ((length(q)==2L || (!is.null(names(q)) && startsWith(names(q)[3L], "na")))) return(TRUE)
-          #                       ^^ base::startWith errors on NULL unfortunately
-          if (length(q)>=2L && q[[1L]] == "shift") {
-            q_named = match.call(shift, q)
-            if (!is.call(q_named[["fill"]]) && is.null(q_named[["give.names"]])) return(TRUE)
+          if (!(is.call(q) && is.symbol(q1<-q[[1L]]) && is.symbol(q2<-q[[2L]]) && q1 %chin% gfuns)) return(FALSE)
+          if (!q2 %chin% names(SDenv$.SDall) && q2!=".I") return(FALSE)  # 875
+          if (q1=="weighted.mean") return(TRUE)  #3977; the weight argument can be a symbol
+          if (length(q)==2L) return(TRUE)  # e.g. sum(colA) (i.e. no arguments)
+          # establish named arguments; e.g. if both na.rm and n have been passed to first/last, either one could appear first
+          f = get(q1)  # maybe a lookup table would be faster than get() but speed should be insignificant here up-front one-time
+          if (!is.primitive(f)) q = match.call(f, q)
+          # else the gfuns which are primitive functions (which match.call doesn't support and errors) are
+          # all of the form fun(..., na.rm=FALSE) so the na.rm arg has to be named in full by the user and
+          # so will be named in q already
+          for (argnum in seq.int(3L, length(q))) {
+            arg = if (is.null(names(q))) "" else names(q)[argnum]
+            if (arg=="na.rm") next
+            if (q1=="shift") {
+              if (arg=="fill" && (is.symbol(q$fill) || is.atomic(q$fill))) next
+              if (arg=="type" && (is.symbol(q$type) || is.atomic(q$type))) next # test 2224.01
+              if (arg=="n") next # negative n is supported by optimized shift, and in test 2224.01 n=-1 appears as a call to '-'()
+            }
+            if (arg=="n") {
+              if (!is.atomic(q$n) || !isTRUE(q$n>0L)) return(FALSE)  # n<=0 not optimized for first/last, [, [[
+              next
+            }
+            if (arg!="") return(FALSE)  # e.g. trim= and fill's give.names= are not yet optimized
+            if (length(q[[argnum]])!=1L || !is.atomic(q[[argnum]])) return(FALSE)  # test 173.1: DT[,B[B>3],by=A], and test 823: sum(b,a)
+            if (q1=="[[" && !eval(call('is.atomic', q2), envir=x)) return(FALSE)  # test 1581.16: dt[, .(l=l[[1L]]), by=a]
           }
-          if (length(q)>=3L && q[[1L]] == "weighted.mean") return(TRUE)  #3977
-          # otherwise there must be three arguments
-          length(q)==3L && length(q3 <- q[[3L]])==1L && is.numeric(q3) &&
-            ( (q1 %chin% c("head", "tail")) || ((q1 == "[" || (q1 == "[[" && eval(call('is.atomic', q[[2L]]), envir=x))) && q3>0L) )
+          TRUE
         }
         if (jsub[[1L]]=="list") {
           GForce = TRUE
@@ -1862,22 +1878,23 @@ replace_dot_alias = function(e) {
       }
     }
 
-    # adding ghead/gtail(n) support for n > 1 #5060 #523
-    q3 = 0
+    # ghead/gtail/gfirst/glast(n) support for n>1 #5060 #523 #5168
+    qn = 0L
     if (!is.symbol(jsub)) {
       headTail_arg = function(q) {
-        if (length(q)==3L && length(q3 <- q[[3L]])==1L && is.numeric(q3) &&
-         (q1 <- q[[1L]]) %chin% c("ghead", "gtail") && q3!=1) q3
-        else 0
+        if (length(q)<3L || !(q1=q[[1L]]) %chin% c("ghead", "gtail", "gfirst", "glast")) return(0L)
+        q = match.call(get(q1), q)
+        qn = q[["n"]] # not q$n because partial argument matching matches to na.rm when n isn't present
+        if (length(qn)==1L && is.numeric(qn) && qn!=1L) qn else 0L
       }
       if (jsub[[1L]] == "list"){
-        q3 = max(sapply(jsub, headTail_arg))
-      } else if (length(jsub)==3L) {
-        q3 = headTail_arg(jsub)
+        qn = max(sapply(jsub, headTail_arg))
+      } else if (length(jsub)>=3L) {
+        qn = headTail_arg(jsub)
       }
     }
-    if (q3 > 0) {
-      grplens = pmin.int(q3, len__)
+    if (qn > 0L) {
+      grplens = pmin.int(qn, len__)
       g = lapply(g, rep.int, times=grplens)
     } else if (.is_nrows(jsub)) {
       g = lapply(g, rep.int, times=len__)
@@ -2999,10 +3016,10 @@ rleidv = function(x, cols=seq_along(x), prefix=NULL) {
 gfuns = c("[", "[[", "head", "tail", "first", "last", "sum", "mean", "prod",
           "median", "min", "max", "var", "sd", ".N", "shift", "weighted.mean") # added .N for #334
 `g[` = `g[[` = function(x, n) .Call(Cgnthvalue, x, as.integer(n)) # n is of length=1 here.
-ghead = function(x, n) .Call(Cghead, x, as.integer(n)) # n is not used at the moment
-gtail = function(x, n) .Call(Cgtail, x, as.integer(n)) # n is not used at the moment
-gfirst = function(x) .Call(Cgfirst, x)
-glast = function(x) .Call(Cglast, x)
+ghead = function(x, n) .Call(Cghead, x, as.integer(n))
+gtail = function(x, n) .Call(Cgtail, x, as.integer(n))
+gfirst = function(x, n=1L, na.rm=FALSE) .Call(Cgfirst, x, as.integer(n), na.rm)
+glast = function(x, n=1L, na.rm=FALSE) .Call(Cglast, x, as.integer(n), na.rm)
 gsum = function(x, na.rm=FALSE) .Call(Cgsum, x, na.rm)
 gmean = function(x, na.rm=FALSE) .Call(Cgmean, x, na.rm)
 gweighted.mean = function(x, w, na.rm=FALSE) {

@@ -1,84 +1,80 @@
-# data.table defined last(x) with no arguments, just for last. If you need the last 10 then use tail(x,10).
-# for xts class objects it will dispatch to xts::last
-# reworked to avoid loading xts namespace (#3857) then again to fix dispatching of xts class (#4053)
-last = function(x, n=1L, ...) {
-  verbose = isTRUE(getOption("datatable.verbose", FALSE))
-  if (!inherits(x, "xts")) {
-    if (nargs()>1L) {
-      if ("package:xts" %chin% search()) {
-        if (verbose)
-          catf("%s: using %s: %s\n", "last", "xts::last", "!is.xts(x) & nargs>1 & 'package:xts'%in%search()")
-        xts::last(x, n=n, ...)
-      } else {
-        # nocov start
-        if (verbose)
-          catf("%s: using %s: %s\n", "last", "utils::tail", "!is.xts(x) & nargs>1 & !'package:xts'%in%search()")
-        utils::tail(x, n=n, ...)
-        # nocov end
-      }
-    } else {
-      dx = dim(x)
-      if (is.null(dx)) {
-        if (verbose)
-          catf("%s: using %s: %s\n", "last", "'x[[length(x)]]'", "!is.xts(x) & !nargs>1 & is.null(dim(x))")
-        lx = length(x)
-        if (!lx) x else x[[lx]]
-      } else if (is.data.frame(x)) {
-        if (verbose)
-          catf("%s: using %s: %s\n", "last", "'x[nrow(x),]'", "!is.xts(x) & !nargs>1 & is.data.frame(x)")
-        x[dx[1L], , drop=FALSE]
-      } else {
-        if (verbose)
-          catf("%s: using %s: %s\n", "last", "utils::tail", "!is.xts(x) & !nargs>1 & !is.null(dim(x)) & !is.data.frame(x)")
-        utils::tail(x, n=n, ...)
-      }
-    }
-  } else {
-    if (!requireNamespace("xts", quietly=TRUE))
-      stopf("'xts' class passed to %s function but 'xts' is not available, you should have 'xts' installed already", "data.table::last") # nocov
-    if (verbose)
-      catf("%s: using %s: %s\n", "last", "xts::last", "is.xts(x)")
-    xts::last(x, n=n, ...)
-  }
+# data.table originally defined first(x) and last(x) with no arguments just for the single
+# first/last observation. Over time n= has been added since xts::last has n so now it makes
+# sense to support n. The difference to head/tail is the default n=1 vs n=6, and
+# that first/last are not generic for speed by group.
+
+first = function(x, n=1L, na.rm=FALSE, ...) {
+  .firstlast(x, n=n, na.rm=na.rm, first=TRUE, ...)
+}
+
+last = function(x, n=1L, na.rm=FALSE, ...) {
+  .firstlast(x, n=n, na.rm=na.rm, first=FALSE, ...)
 }
 
-first = function(x, n=1L, ...) {
-  verbose = isTRUE(getOption("datatable.verbose", FALSE))
-  if (!inherits(x, "xts")) {
-    if (nargs()>1L) {
-      if ("package:xts" %chin% search()) {
-        if (verbose)
-          catf("%s: using %s: %s\n", "first", "xts::first", "!is.xts(x) & nargs>1 & 'package:xts'%in%search()")
-        xts::first(x, n=n, ...)
-      } else {
-        # nocov start
-        if (verbose)
-          catf("%s: using %s: %s\n", "first", "utils::head", "!is.xts(x) & nargs>1 & !'package:xts'%in%search()")
-        utils::head(x, n=n, ...)
-        # nocov end
+.firstlast = function(x, n=1L, na.rm=FALSE, first=TRUE, ...) {
+  if (inherits(x, "xts")) {
+    if (isTRUE(getOption("datatable.verbose", FALSE)))
+      catf("using %s\n", if (first) "xts::first" else "xts::last")
+    return((if (first) xts::first else xts::last)(x, n=n, na.rm=na.rm, ...))
+  }
+  stopifnot(isTRUEorFALSE(na.rm) || identical(na.rm,"row"))
+  stopifnot(is.numeric(n), length(n)==1L, n>=0L)
+  n = as.integer(n)
+  .headtail = if (first) utils::head else utils::tail
+  if (isFALSE(na.rm) || n==0L)
+    return(.headtail(x, n=n, ...))
+  if (is.data.frame(x)) {
+    if (!nrow(x)) return(x)
+    if (identical(na.rm, "row")) {   # any NA on the row removes that row
+      nna = which_(.Call(Cdt_na, x, seq_along(x)), bool=FALSE)
+      # from na.omit.data.table without calling na.omit which would subset all non-NA rows
+      # TODO: n and first/last could be passed to Cdt_na and it could stop after finding n
+      nna = .headtail(nna, n=n)
+      if (length(nna) < min(n,nrow(x))) {
+        # to match optimized na.rm=TRUE behavior; e.g. when .SD is one column
+        # TODO: extra argument all.na=NA|NULL (or pad.na=) could control this
+        pad = rep.int(NA, min(n,nrow(x))-length(nna))
+        # returning min(n,nrow(x)) is what optimized one-column does because GForce needs to be deterministic by group
+        # currently; i.e. number of items per group doesn't depend on how many NA there are
+        nna = if (first) c(nna, pad) else c(pad, nna)
       }
-    } else {
-      dx = dim(x)
-      if (is.null(dx)) {
-        if (verbose)
-          catf("%s: using %s: %s\n", "first", "'x[[1L]]'", "!is.xts(x) & !nargs>1 & is.null(dim(x))")
-        lx = length(x)
-        if (!lx) x else x[[1L]]
-      } else if (is.data.frame(x)) {
-        if (verbose)
-          catf("%s: using %s: %s\n", "first", "'x[1L,]'", "!is.xts(x) & !nargs>1 & is.data.frame(x)")
-        if (!dx[1L]) x else x[1L, , drop=FALSE]
-      } else {
-        if (verbose)
-          catf("%s: using %s: %s\n", "first", "utils::head", "!is.xts(x) & !nargs>1 & !is.null(dim(x)) & !is.data.frame(x)")
-        utils::head(x, n=n, ...)
+      ans = x[nna,,drop=FALSE]
+      # DT[NA] returns NULL for list columns. TODO: change [.data.table to return NA for list columns
+      # In the meantime, fix up the NULLs here in first/last
+      for (col in which(vapply_1b(ans, is.list))) {
+        for (i in which(vapply_1b(ans[[col]], is.null))) {
+          set(ans, i, col, NA)
+        }
       }
+      return(ans)
+    }
+    # else na.rm==TRUE; select the first/last non-NA within each column
+    ans = lapply(x, .narmVector, n=n, first=first)
+    l = vapply_1i(ans, length)
+    m = min(n, nrow(x))
+    for (i in which(l<m)) {  # pad with NA
+      ans[[i]] = if (first) c(ans[[i]], rep(NA, m-l[i]))
+                 else       c(rep(NA, m-l[i]), ans[[i]])
     }
-  } else {
-    if (!requireNamespace("xts", quietly=TRUE))
-      stopf("'xts' class passed to %s function but 'xts' is not available, you should have 'xts' installed already", "data.table::first") # nocov
-    if (verbose)
-      catf("%s: using %s: %s\n", "first", "xts::first", "is.xts(x)")
-    xts::first(x, n=n, ...)
+    if (is.data.table(x)) setDT(ans) else setDF(ans)
+    setattr(ans, "class", class(x))
+    return(ans)
   }
+  if (!length(x))
+    return(x)
+  if (is.vector(x) && !isFALSE(na.rm))
+    return(.narmVector(x, n=n, first=first))
+  if (!isFALSE(na.rm))
+    stopf("na.rm=TRUE|'row' is not currently supported for '%s'", class(x)[1L])
+  .headtail(x, n=n, ...)  
+  # TODO when n=1, return(x[length(x)]) would save method dispatch overhead
+  # TODO and previous version had lx = length(x); if (!lx) x else x[[lx]].  So empty input returned empty
 }
+
+.narmVector = function(x, n, first) {
+  nna = which_(is.na(x) | (is.list(x) & vapply_1b(x,is.null)), bool=FALSE)   # TODO: again, n and first/last could be passed to C here
+  if (!length(nna)) if (is.list(x)) list(NA) else x[NA_integer_]
+  else if (n==1L)   x[nna[if (first) 1L else length(nna)]]
+  else              x[(if (first) utils::head else utils::tail)(nna, n)]  # TODO: avoid dispatch here and do ourselves since just a vector
+}
+