last/first get argument na.rm #5168

ben-schwen · 2021-09-18T14:09:38Z

Closes #4446
Closes #4239

Adds an na.rm argument with default na.rm=FALSE to first/last and their respective GForce optimized versions gfirst/glast.

Here should be noted that gfirst(na.rm=TRUE) and glast(na.rm=TRUE) return the first resp. last non NA value if such a value exists. If such a value does not exist gfirst/glast will still return NA similar to gmedian(), gvar() and gsd().

consistent behavior across optimization levels

ben-schwen · 2021-09-18T14:12:27Z

edit: consistency issue fixed

options(datatable.verbose=TRUE)
DT = data.table(a=c(1:3,NA), b=1:2, c=c(1L, rep(NA, 3)))
options(datatable.optimize=2L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.001s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'last(.SD, na.rm = TRUE)' to 'list(last(a, na.rm = TRUE), last(c, na.rm = TRUE))'
#> GForce optimized j to 'list(glast(a, na.rm = TRUE), glast(c, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.001s elapsed (0.001s cpu)
#>    b a  c
#> 1: 1 3  1
#> 2: 2 2 NA
options(datatable.optimize=1L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.001s cpu) 
#> lapply optimization changed j from 'last(.SD, na.rm = TRUE)' to 'list(last(a, na.rm = TRUE), last(c, na.rm = TRUE))'
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> 
#>   collecting discontiguous groups took 0.000s for 2 groups
#>   eval(j) took 0.000s for 2 calls
#> 0.000s elapsed (0.000s cpu)
#>    b a  c
#> 1: 1 3  1
#> 2: 2 2 NA
options(datatable.optimize=0L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.001s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.000s cpu) 
#> All optimizations are turned off
#> Making each group and running j (GForce FALSE) ... last: using x[, lapply(.SD, last, na.rm=TRUE)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
#> last: using x[, lapply(.SD, last, na.rm=TRUE)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> 
#>   collecting discontiguous groups took 0.000s for 2 groups
#>   eval(j) took 0.001s for 2 calls
#> 0.001s elapsed (0.001s cpu)
#>    b a  c
#> 1: 1 3  1
#> 2: 2 2 NA

codecov · 2021-09-18T14:16:21Z

Codecov Report

Merging #5168 (29c6825) into master (2f67531) will decrease coverage by 0.01%.
The diff coverage is 98.29%.

❗ Current head 29c6825 differs from pull request most recent head 966f00b. Consider uploading reports for the commit 966f00b to get more accurate results

@@            Coverage Diff             @@
##           master    #5168      +/-   ##
==========================================
- Coverage   99.51%   99.49%   -0.02%     
==========================================
  Files          78       77       -1     
  Lines       14756    14675      -81     
==========================================
- Hits        14684    14601      -83     
- Misses         72       74       +2

Impacted Files	Coverage Δ
R/last.R	`95.55% <95.55%> (-4.45%)`	⬇️
R/data.table.R	`99.89% <100.00%> (+<0.01%)`	⬆️
R/test.data.table.R	`100.00% <100.00%> (ø)`
src/gsumm.c	`100.00% <100.00%> (ø)`
src/init.c	`100.00% <0.00%> (ø)`
R/IDateTime.R	`100.00% <0.00%> (ø)`
src/idatetime.c

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

NEWS.md

ben-schwen · 2021-10-17T14:56:29Z

@mattdowle I fixed expected behavior by turning off apply optimization for first/last. I also clarified in man that first/last always select at least 1 element/row. We might also add an "if present" to the former to exclude the special cases about empty vectors and empty data.tables.

…1 (melt) with 'match.call(fun, orig_call): invalid definition argument' after running the new tests at the prompt; so removed variables and created DT directly in new tests

mattdowle · 2021-12-21T21:37:23Z

R/data.table.R

+      }
+      # only lapply optimize if first/last has na.rm=FALSE see also #5168
+      headopt =  jsub[[1L]] == "head"  || jsub[[1L]] == "tail"
+      firstopt = jsub[[1L]] == "first" || jsub[[1L]] == "last" && !narm_arg(first, jsub) ## fix for #2030


iiuc, the || part here needs parens around it. and if so, there's a test missing because DT[, first(.SD, na.rm=TRUE), by=] is still being optimized but that's not intended. Will do ...
In fact I'm about to change it so that first(DT, na.rm=TRUE) does it column-wise, so then first(.SD, na.rm=TRUE) can be optimized with a consistent result. na.rm="row" can remove rows containing any NA.

…sted

… to use named arguments now that n and na.rm can be items 3 and 4 with either appearing first

…ce at C level which now reps grpcols instead of R level

MichaelChirico · 2022-08-04T02:23:54Z

we should really build something for this into test() itself...

filed #5429 to revisit this later

mattdowle · 2022-08-04T02:35:36Z

Agree. One thing is that we test verbose output for specific strings to ensure optimization is on or off when we think it should be. Maybe just a tweak to test() would overcome that. Plus moving the 2 or 3 relatively longer running sections into benchmark.Rraw. Those could be left in test.Rraw with a reduced size so they're still tested for correctness. Nothing else springs to mind. It's at under 1 minute so doubling that should be ok on CRAN.

MichaelChirico · 2022-08-04T02:40:04Z

If it's only output to test differently we can do c(1 = "output to match under optimize=1") (or reverse). adding to issue thread...

…ength-1 and pad with NA

jangorecki · 2022-10-06T08:30:41Z

"wip shift multiple n return data.table rather than list" commit.
@mattdowle it looks like a serious breaking change. Was that even requested anywhere? Current approach worked well, and also has smaller overhead. If one uses shift output as a list (even: cols := shift(...)), then it is actually better to keep it as a list rather than data.table, at least by default.

mattdowle · 2022-10-06T17:23:57Z

@jangorecki shift multiple n returning list rather than data.table was causing problems in this PR yes. Since a data.table is a list too, I don't see much of an issue. But as I wrote it is wip. Since the list returned contains vectors of the same length, marking it as a data.table so anything using it knows it has a regular shape makes some sense to me. I don't think calling it out as a 'serious' breaking change helps.

MichaelChirico · 2024-02-19T03:15:47Z

@ben-schwen do you want to update this PR to master to proceed with review for 1.16.0?

MichaelChirico · 2024-07-27T19:00:05Z

Ping @ben-schwen appreciate your help resolving conflicts here :)

ben-schwen · 2024-07-28T07:07:23Z

Ping @ben-schwen appreciate your help resolving conflicts here :)

Will do. Not sure if I find the time today but definitely over the next week!

MichaelChirico · 2024-08-02T23:40:02Z

~~Got through part of the conflict resolution but ran out of time. committed with conflict markers still in place for now.~~

Done all but gsumm.c. The changes there are pretty substantial and hard to untangle. Maybe a rebase would be easier to do (untangling the changes one commit at a time), but will definitely consume a ton of time. I'm pulling this off the 1.16.0 milestone for now, but feel free to take it up & ping for review.

HughParsonage

SET_TRUELENGTH is not API

ben-schwen · 2024-08-03T07:35:12Z

~~Got through part of the conflict resolution but ran out of time. committed with conflict markers still in place for now.~~

Done all but gsumm.c. The changes there are pretty substantial and hard to untangle. Maybe a rebase would be easier to do (untangling the changes one commit at a time), but will definitely consume a ton of time. I'm pulling this off the 1.16.0 milestone for now, but feel free to take it up & ping for review.

That was my plan. Matt's change in gsumm.c look favorable and make sense but I want to split them into multiple PRs since they are hard to grasp

new syntax

f976102

ben-schwen marked this pull request as draft September 18, 2021 14:10

ben-schwen added 3 commits September 18, 2021 16:21

add coverage

a2c067a

added consistency

8968895

add coverage

69d90a7

ben-schwen marked this pull request as ready for review September 18, 2021 21:53

mattdowle reviewed Sep 24, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

ben-schwen added 7 commits October 17, 2021 14:49

update .SD optimization

6355d63

merge master

16b16f6

coverage

dcf49ba

update docs

7966d57

fix NEWS

c5d2bc0

update tests

be5a1f1

turn on lapply optimization for head/tail

c6c1d42

merge master

b04b51f

mattdowle added this to the 1.14.3 milestone Dec 13, 2021

mattdowle added 4 commits December 15, 2021 17:42

Merge branch 'master' into last_narm

91bae5e

Merge branch 'master' into last_narm

3a45476

Merge branch 'master' into last_narm

6d52148

creating 'c' variable in new tests caused cc() to fail on test 1035.0…

384da9b

…1 (melt) with 'match.call(fun, orig_call): invalid definition argument' after running the new tests at the prompt; so removed variables and created DT directly in new tests

mattdowle reviewed Dec 21, 2021

View reviewed changes

mattdowle added 2 commits December 31, 2021 16:46

remove && \!narm_arg(first, jsub) temporarily to confirm it wasn't te…

dad0537

…sted

rework tests & last.R, add na.rm='row'

58195f4

mattdowle added the WIP label Jan 20, 2022

mattdowle added 3 commits January 24, 2022 16:31

add TRUE/'row' to news item, add last(.SD) tests by group

d8e76e4

.(last(one col), first(another col)) by group optimized

75676b7

n>1 with na.rm=TRUE too added to gfirst and glast, reworked gforce_ok…

e520cab

… to use named arguments now that n and na.rm can be items 3 and 4 with either appearing first

mattdowle added 3 commits August 2, 2022 02:00

Added gforce_dynamic attrib returned by gfirstlast and gshift to gfor…

94eb6ed

…ce at C level which now reps grpcols instead of R level

merge master

394556a

repeat tests with optimization off and on

4964908

mattdowle added 3 commits August 4, 2022 01:08

top align

c9f5507

first/last return 'true vectors' so dogroups.c knows not to recycle l…

b0e6ba1

…ength-1 and pad with NA

comments only

9256ad7

tdhock mentioned this pull request Aug 8, 2022

first function not working as fun.aggregate for dcast if fill argument is not provided #5390

Closed

mattdowle added 2 commits August 10, 2022 23:22

catch first/last n>1 := by group with helpful error

beded95

wip shift multiple n return data.table rather than list

966f00b

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

MichaelChirico mentioned this pull request Mar 26, 2024

Add options= to test(), convert most Rraw scripts #5845

Draft

Merge branch 'master' into last_narm

324ce38

MichaelChirico requested a review from HughParsonage as a code owner August 2, 2024 23:38

one more resolution

0fb58b5

MichaelChirico force-pushed the last_narm branch from 324ce38 to 966f00b Compare August 3, 2024 04:24

more resolution

48281d5

MichaelChirico modified the milestones: 1.16.0, 1.17.0 Aug 3, 2024

HughParsonage reviewed Aug 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

last/first get argument na.rm #5168

last/first get argument na.rm #5168

ben-schwen commented Sep 18, 2021 •

edited

Loading

ben-schwen commented Sep 18, 2021 •

edited

Loading

codecov bot commented Sep 18, 2021 •

edited

Loading

ben-schwen commented Oct 17, 2021

mattdowle Dec 21, 2021 •

edited

Loading

MichaelChirico commented Aug 4, 2022

mattdowle commented Aug 4, 2022 •

edited

Loading

MichaelChirico commented Aug 4, 2022

jangorecki commented Oct 6, 2022 •

edited

Loading

mattdowle commented Oct 6, 2022 •

edited

Loading

MichaelChirico commented Feb 19, 2024

MichaelChirico commented Jul 27, 2024

ben-schwen commented Jul 28, 2024

MichaelChirico commented Aug 2, 2024 •

edited

Loading

HughParsonage left a comment

ben-schwen commented Aug 3, 2024

last/first get argument na.rm #5168

Are you sure you want to change the base?

last/first get argument na.rm #5168

Conversation

ben-schwen commented Sep 18, 2021 • edited Loading

ben-schwen commented Sep 18, 2021 • edited Loading

codecov bot commented Sep 18, 2021 • edited Loading

Codecov Report

ben-schwen commented Oct 17, 2021

mattdowle Dec 21, 2021 • edited Loading

Choose a reason for hiding this comment

MichaelChirico commented Aug 4, 2022

mattdowle commented Aug 4, 2022 • edited Loading

MichaelChirico commented Aug 4, 2022

jangorecki commented Oct 6, 2022 • edited Loading

mattdowle commented Oct 6, 2022 • edited Loading

MichaelChirico commented Feb 19, 2024

MichaelChirico commented Jul 27, 2024

ben-schwen commented Jul 28, 2024

MichaelChirico commented Aug 2, 2024 • edited Loading

HughParsonage left a comment

Choose a reason for hiding this comment

ben-schwen commented Aug 3, 2024

ben-schwen commented Sep 18, 2021 •

edited

Loading

ben-schwen commented Sep 18, 2021 •

edited

Loading

codecov bot commented Sep 18, 2021 •

edited

Loading

mattdowle Dec 21, 2021 •

edited

Loading

mattdowle commented Aug 4, 2022 •

edited

Loading

jangorecki commented Oct 6, 2022 •

edited

Loading

mattdowle commented Oct 6, 2022 •

edited

Loading

MichaelChirico commented Aug 2, 2024 •

edited

Loading