Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

last/first get argument na.rm #5168

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open

last/first get argument na.rm #5168

wants to merge 49 commits into from

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Sep 18, 2021

Closes #4446
Closes #4239

Adds an na.rm argument with default na.rm=FALSE to first/last and their respective GForce optimized versions gfirst/glast.

Here should be noted that gfirst(na.rm=TRUE) and glast(na.rm=TRUE) return the first resp. last non NA value if such a value exists. If such a value does not exist gfirst/glast will still return NA similar to gmedian(), gvar() and gsd().

  • consistent behavior across optimization levels

@ben-schwen ben-schwen marked this pull request as draft September 18, 2021 14:10
@ben-schwen
Copy link
Member Author

ben-schwen commented Sep 18, 2021

edit: consistency issue fixed

options(datatable.verbose=TRUE)
DT = data.table(a=c(1:3,NA), b=1:2, c=c(1L, rep(NA, 3)))
options(datatable.optimize=2L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.001s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'last(.SD, na.rm = TRUE)' to 'list(last(a, na.rm = TRUE), last(c, na.rm = TRUE))'
#> GForce optimized j to 'list(glast(a, na.rm = TRUE), glast(c, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.001s elapsed (0.001s cpu)
#>    b a  c
#> 1: 1 3  1
#> 2: 2 2 NA
options(datatable.optimize=1L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.001s cpu) 
#> lapply optimization changed j from 'last(.SD, na.rm = TRUE)' to 'list(last(a, na.rm = TRUE), last(c, na.rm = TRUE))'
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> 
#>   collecting discontiguous groups took 0.000s for 2 groups
#>   eval(j) took 0.000s for 2 calls
#> 0.000s elapsed (0.000s cpu)
#>    b a  c
#> 1: 1 3  1
#> 2: 2 2 NA
options(datatable.optimize=0L)
DT[, last(.SD, na.rm=TRUE), b]
#> Argument 'by' after substitute: b
#> Finding groups using forderv ... forder.c received 4 rows and 1 columns
#> 0.000s elapsed (0.001s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 2
#> 0.000s elapsed (0.000s cpu) 
#> All optimizations are turned off
#> Making each group and running j (GForce FALSE) ... last: using x[, lapply(.SD, last, na.rm=TRUE)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
#> last: using x[, lapply(.SD, last, na.rm=TRUE)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> last: using last(x[!is.na(x)]): na.rm=TRUE
#> 
#>   collecting discontiguous groups took 0.000s for 2 groups
#>   eval(j) took 0.001s for 2 calls
#> 0.001s elapsed (0.001s cpu)
#>    b a  c
#> 1: 1 3  1
#> 2: 2 2 NA

@codecov
Copy link

codecov bot commented Sep 18, 2021

Codecov Report

Merging #5168 (29c6825) into master (2f67531) will decrease coverage by 0.01%.
The diff coverage is 98.29%.

❗ Current head 29c6825 differs from pull request most recent head 966f00b. Consider uploading reports for the commit 966f00b to get more accurate results

@@            Coverage Diff             @@
##           master    #5168      +/-   ##
==========================================
- Coverage   99.51%   99.49%   -0.02%     
==========================================
  Files          78       77       -1     
  Lines       14756    14675      -81     
==========================================
- Hits        14684    14601      -83     
- Misses         72       74       +2     
Impacted Files Coverage Δ
R/last.R 95.55% <95.55%> (-4.45%) ⬇️
R/data.table.R 99.89% <100.00%> (+<0.01%) ⬆️
R/test.data.table.R 100.00% <100.00%> (ø)
src/gsumm.c 100.00% <100.00%> (ø)
src/init.c 100.00% <0.00%> (ø)
R/IDateTime.R 100.00% <0.00%> (ø)
src/idatetime.c

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ben-schwen ben-schwen marked this pull request as ready for review September 18, 2021 21:53
NEWS.md Outdated Show resolved Hide resolved
@ben-schwen
Copy link
Member Author

@mattdowle I fixed expected behavior by turning off apply optimization for first/last. I also clarified in man that first/last always select at least 1 element/row. We might also add an "if present" to the former to exclude the special cases about empty vectors and empty data.tables.

@mattdowle mattdowle added this to the 1.14.3 milestone Dec 13, 2021
…1 (melt) with 'match.call(fun, orig_call): invalid definition argument' after running the new tests at the prompt; so removed variables and created DT directly in new tests
R/data.table.R Outdated
}
# only lapply optimize if first/last has na.rm=FALSE see also #5168
headopt = jsub[[1L]] == "head" || jsub[[1L]] == "tail"
firstopt = jsub[[1L]] == "first" || jsub[[1L]] == "last" && !narm_arg(first, jsub) ## fix for #2030
Copy link
Member

@mattdowle mattdowle Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, the || part here needs parens around it. and if so, there's a test missing because DT[, first(.SD, na.rm=TRUE), by=] is still being optimized but that's not intended. Will do ...
In fact I'm about to change it so that first(DT, na.rm=TRUE) does it column-wise, so then first(.SD, na.rm=TRUE) can be optimized with a consistent result. na.rm="row" can remove rows containing any NA.

@mattdowle mattdowle added the WIP label Jan 20, 2022
@MichaelChirico
Copy link
Member

we should really build something for this into test() itself...

filed #5429 to revisit this later

@mattdowle
Copy link
Member

mattdowle commented Aug 4, 2022

Agree. One thing is that we test verbose output for specific strings to ensure optimization is on or off when we think it should be. Maybe just a tweak to test() would overcome that. Plus moving the 2 or 3 relatively longer running sections into benchmark.Rraw. Those could be left in test.Rraw with a reduced size so they're still tested for correctness. Nothing else springs to mind. It's at under 1 minute so doubling that should be ok on CRAN.

@MichaelChirico
Copy link
Member

If it's only output to test differently we can do c(1 = "output to match under optimize=1") (or reverse). adding to issue thread...

@jangorecki
Copy link
Member

jangorecki commented Oct 6, 2022

"wip shift multiple n return data.table rather than list" commit.
@mattdowle it looks like a serious breaking change. Was that even requested anywhere? Current approach worked well, and also has smaller overhead. If one uses shift output as a list (even: cols := shift(...)), then it is actually better to keep it as a list rather than data.table, at least by default.

@mattdowle
Copy link
Member

mattdowle commented Oct 6, 2022

@jangorecki shift multiple n returning list rather than data.table was causing problems in this PR yes. Since a data.table is a list too, I don't see much of an issue. But as I wrote it is wip. Since the list returned contains vectors of the same length, marking it as a data.table so anything using it knows it has a regular shape makes some sense to me. I don't think calling it out as a 'serious' breaking change helps.

@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico
Copy link
Member

@ben-schwen do you want to update this PR to master to proceed with review for 1.16.0?

@MichaelChirico
Copy link
Member

Ping @ben-schwen appreciate your help resolving conflicts here :)

@ben-schwen
Copy link
Member Author

Ping @ben-schwen appreciate your help resolving conflicts here :)

Will do. Not sure if I find the time today but definitely over the next week!

@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 2, 2024

Got through part of the conflict resolution but ran out of time. committed with conflict markers still in place for now.

Done all but gsumm.c. The changes there are pretty substantial and hard to untangle. Maybe a rebase would be easier to do (untangling the changes one commit at a time), but will definitely consume a ton of time. I'm pulling this off the 1.16.0 milestone for now, but feel free to take it up & ping for review.

@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Aug 3, 2024
Copy link
Member

@HughParsonage HughParsonage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SET_TRUELENGTH is not API

@ben-schwen
Copy link
Member Author

Got through part of the conflict resolution but ran out of time. committed with conflict markers still in place for now.

Done all but gsumm.c. The changes there are pretty substantial and hard to untangle. Maybe a rebase would be easier to do (untangling the changes one commit at a time), but will definitely consume a ton of time. I'm pulling this off the 1.16.0 milestone for now, but feel free to take it up & ping for review.

That was my plan. Matt's change in gsumm.c look favorable and make sense but I want to split them into multiple PRs since they are hard to grasp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a na.rm argument to first()/last() FR: na.rm in first/last
5 participants