Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip sorting already sorted #4501

Merged
merged 18 commits into from
Jul 29, 2024
Merged

Skip sorting already sorted #4501

merged 18 commits into from
Jul 29, 2024

Conversation

ColeMiller1
Copy link
Contributor

@ColeMiller1 ColeMiller1 commented May 27, 2020

Closes #4498

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3

x <- as.data.table(as.character(rnorm(200000,1,0.5)))
setkey(x,V1)

##before fix
system.time(a1 <- x[,.(V1)])
#>    user  system elapsed 
#>    0.94    0.50    1.48
system.time(a2 <- x[, c('V1')])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, 1])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, .SD])
#>    user  system elapsed 
#>    0.72    0.52    1.41
system.time(x[, c(.SD, .(V1))])
#>    user  system elapsed 
#>    0.75    0.61    1.53

##after fix

system.time(a1 <- x[,.(V1)])
#>    user  system elapsed 
#>    0.01    0.00    0.01
system.time(a2 <- x[, c('V1')])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, 1])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, .SD])
#>    user  system elapsed 
#>       0       0       0
system.time(x[, c(.SD, .(V1))]) ##still slow!
#>    user  system elapsed 
#>    0.75    0.61    1.53

@codecov
Copy link

codecov bot commented May 27, 2020

Codecov Report

Merging #4501 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4501   +/-   ##
=======================================
  Coverage   99.60%   99.61%           
=======================================
  Files          73       73           
  Lines       14102    14116   +14     
=======================================
+ Hits        14047    14061   +14     
  Misses         55       55           
Impacted Files Coverage Δ
R/data.table.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88bd1eb...739abe5. Read the comment docs.

R/data.table.R Outdated
if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
if (haskey(x) &&
all(key(x) %chin% names(jval)) &&
((is.null(irows) && jsub %iscall% "list" && all(vapply_1b(as.list(jsub)[-1L], is.name))) || ##fix for #4498
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this misses something like key=V1 and j=.(V1, f(V2)), as well as like j=c(list(V1=V1), .SD).

the latter is messy, maybe best to let is.sorted happen there.

I was thinking to add this check much earlier, around when j is first analyzed, there must be some similar loop done over j where we can tack this on.

FWIW both logics miss something like key=V1, j=.(newV = V1), but again, not sure how worth pursuing

Copy link
Member

@jangorecki jangorecki May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we need more reliable way to figure that out. Better to be more conservative and not retain key when, we cannot be sure. Instead of recalculating order.

R/data.table.R Outdated
if (is.null(irows) && !is.null(shared_keys)) {
setattr(jval, 'sorted', shared_keys)
# potentially inefficient backup -- check if jval is sorted by key(x)
} else if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have dropped this branch but there are a few tests relying on it, e.g. when lapply(.SD, as.numeric) is used to integer column, it can still be sorted by key, or when irows is there but it's an ordered subset.

Do we check anywhere else whether irows is an ordered subset? maybe an opportunity to leverage ALTREP to quickly determine 1:n subsets are ordered.

@ColeMiller1 please have a look, I think the logic is improved but maybe could be further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this above:

orderedirows = .Call(CisOrderedSubset, irows, nrow(x))  # TRUE when irows is NULL (i.e. no i clause).

but only in the !missingby region

Copy link
Contributor Author

@ColeMiller1 ColeMiller1 May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your new fx. We may want to check for is.name(.SD) so we can pull out those SD vars as well.

I think the is.sorted() could be avoided if:

  1. This is a join There is a join and both X[Y] have keys in common. or
  2. !length(forder(irows)) edit: or as you point out, we could do .Call(CisOrderedSubset, irows, nrow(x)) here or outside of just the !missingby region

I can work on it some this evening.

@@ -3641,7 +3641,7 @@ test(1118, dt[, lapply(.SD, function(y) weighted.mean(y, b2, na.rm=TRUE)), by=x]

# a(nother) test of #295
DT <- data.table(x=5:1, y=1:5, key="y")
test(1119, is.null(key(DT[, list(z = y, y = 1/y)])))
test(1119, key(DT[, list(z = y, y = 1/y)]), 'z')
Copy link
Member

@MichaelChirico MichaelChirico May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before (as in, the reason for this test), I think [ was getting confused by the redefinition of y, but now we re-map the key to z, correctly

@jangorecki
Copy link
Member

jangorecki commented May 27, 2020

Possible alternative to the approach made so far in this PR is to have a flag to be used all the way [.data.table and set it whenever order can be changed. Current approach tries to figure that out at one place, which is prone to error.

In the first place, we should not try to retain key in every case we do that now, because it wouldn't be possible if we want to reduce the overhead.

  1. If we agree to be more conservative and not call is.sorted at all, then this will be breaking change. I would postpone that for next release.
  2. If we want to retain key always, then we can only optimize "easy cases" to not call is.sorted, which IMO is error prone. As this is not a regression in current devel I would postpone that to next release anyway.

Ultimately I would go for 1. as there is no point to retain key at the cost of extra sort, because it is unknown if that is desired, so imposing overhead in each case like this in unjustified.

@MichaelChirico
Copy link
Member

Agree w the above. We can leave the simple static case like added here, which should cover most use cases, and plan to deprecate the is.sorted part for later releases -- it feels quite fragile & like it can be a huge gotcha on some queries unexpectedly.

Eventually we could make the logic more sophisticated -- I think it makes more sense to start in limited logic, and add new cases, rather than start in broad logic & narrow it down. I would file a follow-up for that.

NEWS.md Outdated
@@ -109,6 +109,8 @@ unit = "s")

13. A relatively rare case of segfault when combining non-equi joins with `by=.EACHI` is now fixed, closes [#4388](https://github.com/Rdatatable/data.table/issues/4388).

14. Selecting keyed list columns will retain key without a performance penalty, closes [#4498](https://github.com/Rdatatable/data.table/issues/4498). Thanks to @user9439449 on StackOverflow for the report.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to link user via @ because it could eventually point to a different user here on github

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pinged them on SO to add their name if possible. will leave this for now

@mattdowle
Copy link
Member

Thanks, @ColeMiller1 for having a good go at this one! Really bad performance here that you fixed, high priority. But yes I agree with the others that the direction in this PR is getting a bit fragile.
Root cause, iiuc, is that is.sorted is calling forder at all. I was surprised to see that. I seem to remember I might have added that -- sorry all. The comment on the line that calls is.sorted contains # TO DO: ... tidy up and make efficient. What that meant was make is.sorted efficient inside it by not calling forderv. Currently Cfsorted works efficiently on a single column. Like R's is.unsorted it just loops through the vector and stops on the first >. That needs to be extended to multi-column, and made to support passing a subset of columns, too. Then is.sorted would be (finally) implemented as originally intended (multi-column and single-sweep) and we can use it without a performance penalty.
There's also a comment in Cfsorted that it can be parallelized. If n blocks are each sorted, and the boundaries of the blocks are in order too, then it's sorted. But just getting rid of the awful forderv call inside is.sorted should solve 99% of this bad performance. I'll take a look at expanding Cfsorted.

@jangorecki
Copy link
Member

jangorecki commented May 27, 2020

@mattdowle is.sorted even when improved, as you described, should be taken out from there IMO. In the worse case scenario, when input is already sorted, then is.sorted will need to loop over all elements anyway. Case reported in this issue is a single column, so should use faster is.sorted already, but because it still needs to iterate over all elements, it is slow. User might not need a key, so we should not impose overhead by default.

@mattdowle
Copy link
Member

@jangorecki I think you're saying that is.sorted is dispatching to Cfsorted at the moment. If so, I don't think that's correct. is.sorted is being passed a list (a data.table) so the first branch is happening which calls forderv. The warning is happening and being suppressed by the caller. What's taking the huge amount of time in this one-column example is sorting the large amount of unique strings.

@jangorecki
Copy link
Member

Uh, yes, you are right, it is single column, but wrapped in a list. But even then, when using faster is.sorted, we need to loop over whole input, in case it was sorted.

@mattdowle
Copy link
Member

mattdowle commented May 27, 2020

Yes but looping contiguously over the input (not jumping around it) is usually insignificant. In this case there are huge number of unique strings, so, when the CHARSXP pointers are not equal, the strings have to compared, which may not be insignificant in this case, but let's see.
If it isn't sorted, it'll return quickly.
If it is sorted, then even the insignificant contiguous-scan is probably worth it, for the onward benefit of retaining the key. But let's see what the timing is for this worst-case very-many-unique-string example, before deciding.

R/data.table.R Outdated
if (is.null(irows) && !is.null(shared_keys)) {
setattr(jval, 'sorted', shared_keys)
# potentially inefficient backup -- check if jval is sorted by key(x)
} else if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
} else if (haskey(x) && all(key(x) %chin% names(jval)) && (is.null(irows) || (!roll && .Call(CisOrderedSubset, irows, nrow(x))) || suppressWarnings(is.sorted(jval, by=key(x))))) { # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm isn't this condition wrong though? Since all we've checked is that key(x) shows up in names(jval), a query like this would pass right?

DT = data.table(a=1, b=2, c=3, key='a,b,c')
DT[ , .(a = 10:1, b=1:10, c=10:1)]

(1) DT is keyed
(2) a,b,c all show up as names in jval
(3) is.null(irows)

We still need to run the is.sorted in this case right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ confirmed on this example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColeMiller1 I guess we want this new condition inside the earlier branch right?

if ((is.null(irows) || (!roll && .Call(CisOrderedSubset, irows, nrow(x)))) && !is.null(shared_keys)) {

Copy link
Contributor Author

@ColeMiller1 ColeMiller1 May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a fragile approach. I wanted to capture the example of c(.SD, .(newvar = 3+3)) but based on Matt's work on is.sorted, I do not think we need NSE to try to capture it. Will remove now.

Edit: Note, it was too fragile and would have been changed regardless of Matt's work.

@MichaelChirico
Copy link
Member

Redirecting @mattdowle 's comment from #4508 since that discussion should be on this issue:

Are we comfortable it is safe; i.e. won't result in an invalid key. I was going to look at it next. This speed issue was very bad in this edge case, but at least the result was correct.

That's the rub, yes... with that in mind I'm tempted to roll back @ColeMiller1 's latest commit: 21e68c0

I'm confident get_shared_keys is working as intended & it's pretty low-risk -- when x is keyed by K1, ..., KN we re-apply existing keys K1, ..., Kn (n <= N) when:

  • i is empty (is.null(irows))
  • j is a call to list
  • K1, ..., Kn show up in j as names directly, like .(a = K1) or .(K2)

K(n+1) is the first key that doesn't show up as a name.

That seems safe to me.

21e68c0 adds a few more cases that pass the smell test but should at least be vetted more thoroughly:

  • j is .SD and the keys are available from .SDcols
  • i may be non-NULL, but it's an ordered subset

@MichaelChirico
Copy link
Member

@ColeMiller1 want to have a go at adding a performance regression test here? :)

Copy link

github-actions bot commented Jul 11, 2024

Comparison Plot

Generated via commit b594be7

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 58 seconds

Time taken to run atime::atime_pkg on the tests: 5 minutes and 7 seconds

@ColeMiller1
Copy link
Contributor Author

I can no longer reproduce such large differences. I have been doing more with WSL and that might be part of it. With 2 million rows, the patch is super fast (0.001s) while the parent is not the worst (0.1s) at performing x[, .SD].

I tried to put together a regression and while atime seems great, I don't know the best route to test on my local. Nor do I know if this is best practices for the tests. For example, the smallest N I use is 5 because I am not sure we get much information with results much lower than 10,000 rows for this particular PR / issue.

  # Issue reported in https://github.com/Rdatatable/data.table/issues/4498
  # To be fixed in: https://github.com/Rdatatable/data.table/pull/4501
  "Skip sorting keyed values" = atime::atime_test(
    N = 10^seq(5, 7),
    setup = {
        L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
        setkey(L, V1)
    },
    expr = {
        x[, .SD]
    },
    Before = "3ca83738d70d5597d9e168077f3768e32569c790" # Parent of the first commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/commit/3ca83738d70d5597d9e168077f3768e32569c790)
    After = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/pull/4501/commits)
  )

@tdhock
Copy link
Member

tdhock commented Jul 12, 2024

usually better to start with a very small N. (10 = 10^1)
When in doubt, you can leave N blank, because the default N is usually pretty good, 2^seq(1, 20)
You can test on your local machine by just using the same arguments with atime::atime_versions,

vinfo <- atime::atime_versions(
   pkg.edit.fun = pkg.edit.fun, 
    N = 10^seq(1, 7, by=0.5),
    setup = {
        L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
        setkey(L, V1)
    },
    expr = {
        data.table:::`[.data.table`(L, , .SD)
    },
    Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
    Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)
plot(vinfo)
refs <- atime::references_best(vinfo)
plot(refs)
pred <- predict(refs)
plot(pred)

Also you should use data.table:::`[.data.table`(x,i,j) instead of x[i,j] because the different versions work by substituting data.table:: with data.table.someSHAversion:: (a different package name is created and installed for each version, and to have that package work you need to provide pkg.edit.fun https://github.com/Rdatatable/data.table/blob/master/.ci/atime/tests.R#L43)
Also is there a historical commit that caused a regression? If so please add that commit with name "Regression"
Otherwise if there is no regression (code was always slow in the past) please change Before to Slow and After to Fast.
I have updated the wiki page https://github.com/Rdatatable/data.table/wiki/Performance-testing to document.
In particular please read https://github.com/Rdatatable/data.table/wiki/Performance-testing#documentation-of-historical-commits which explains how to document the historical commits.
It looks to me like your version and documentation comment for the Before commit were incorrect. Here is my reasoning

@ColeMiller1
Copy link
Contributor Author

Very helpful, Toby. Thanks!

I do not believe there was a specific commit that caused it. More, Michael identified that there was room for making the logic better so we could re-use an existing key in certain circumstances. Around the same time, Matt found a way to fix the poor performance of ordering so that the is.sorted() call was less expensive. That meant that this PR was less important.

I am still learning about git. I assumed when Michael merged master into this branch, it would mean that the master commit at the time of the merge would serve as the "parent". Two parents were listed on that commit - one parent was the last commit on this branch and the other parent was the last commit on master. Maybe I'm over thinking it.
353dc7a

I actually had changed my sample from Slow to Before and Fast to After so I guess I'm still getting the hang of this. For general regression testing, I'm not sure we would need additional commits, right? In this case, base=master and HEAD=fix_sorting_on_sorted is what we would need to review if this commit is working. If I understand the intention of the graphs, adding Slow and Fast might clutter up future PRs. Alternatively, it might be nice if on only the branch the test was made for that it showed the additional information. E.g., on this PR we might see something like Skip sorting keyed values with Slow/Fast commits on the graph but on other PR branches, we wouldn't.

Anyway, I couldn't get atime to work on my local machine. It wants a pkg.path. I looked at the documentation and I think I know what to do. But... should I continue with a performance test for this PR?

@tdhock
Copy link
Member

tdhock commented Jul 13, 2024

Hi Cole sorry about forgetting the pkg.path in the code above, it should be the path to your clone of the data.table git repo.
I hear that you don't think this PR is a huge performance improvement after all, but the atime result I computed still looks like a pretty big performance improvement,
image

pkg.edit.fun = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
vinfo <- atime::atime_versions(
  pkg.path="~/R/data.table",
  N=2^seq(1,20),
  pkg.edit.fun = pkg.edit.fun, 
  setup = {
    L = as.data.table(as.character(rnorm(N, 1, 0.5)))
    setkey(L, V1)
  },
  expr = {
    data.table:::`[.data.table`(L, , .SD)
  },
  Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
  Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)
plot(vinfo)

@tdhock
Copy link
Member

tdhock commented Jul 13, 2024

About the commit ids/docs my issue was with this line

    Before = "3ca83738d70d5597d9e168077f3768e32569c790" # Parent of the first commit in the PR that fixes the issue (https://github.com/Rdatatable/data.table/commit/3ca83738d70d5597d9e168077f3768e32569c790)

which is problematic because 3ca8373 is not the parent of the first commit in this PR. It is a parent of the last commit in this PR. Both commits are interesting but make sure to document them correctly with the comments so we can verify their provenance. Here is the result I computed with them:
image
In agreement with your prediction that Matt's changes had already resulted in some speedups, we see that "master parent" is faster than "Slow parent of first." But we also see that "Fast last" and "second to last" are faster still, which means this PR is actually helpful, relative to current master (not just old master at the time of the first commit in this PR. We should probably include both as historical Slow references. Maybe "Slow old" and "Slow new" ?

vinfo <- atime::atime_versions(
  pkg.path="~/R/data.table",
  N=2^seq(1,20),
  pkg.edit.fun = pkg.edit.fun, 
  setup = {
    L = as.data.table(as.character(rnorm(N, 1, 0.5)))
    setkey(L, V1)
  },
  expr = {
    data.table:::`[.data.table`(L, , .SD)
  },
  "master\nparent"="3ca83738d70d5597d9e168077f3768e32569c790",
  "second\nto last"="739abe53d0722c2e71d59c14b414d6d56f213173",
  "Slow\nparent\nof first" = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
  "Fast\nLast" = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue 
)
plot(vinfo)

@tdhock
Copy link
Member

tdhock commented Jul 13, 2024

I'm still a little confused about how the expr data.table:::`[.data.table`(L, , .SD) involves sorting, maybe you can clarify in the comments next to this performance test case?

@tdhock
Copy link
Member

tdhock commented Jul 13, 2024

For general regression testing, I'm not sure we would need additional commits, right? In this case, base=master and HEAD=fix_sorting_on_sorted is what we would need to review if this commit is working. If I understand the intention of the graphs, adding Slow and Fast might clutter up future PRs.

It is true that Slow and Fast (and other historical references) are not strictly necessary to see future regressions, and they could add visual clutter, but they also essential to answer an important question: is the current performance as good/bad (or better than) the bad/goos performance we have seen in the past? (at the time the issue was created/fixed)

@tdhock
Copy link
Member

tdhock commented Jul 13, 2024

Running predict method then plot method shows the data size N that the code is able to handle for a given time limit, 0.01 seconds is the default used here. We see that the PR (Fast Last) is more than 10x faster than current master (master parent) which is great!
image

@ColeMiller1
Copy link
Contributor Author

I'm still a little confused about how the expr data.table:::[.data.table(L, , .SD) involves sorting, maybe you can clarify in the comments next to this performance test case?

Yeah, I can add some comments. Previous behavior was to be very cautious about setting keys. If dt had a key, all names were in jsub, and it was verified to be sorted, a key would be retained. In this PR, the physical verification of sorted-ness is removed as well as retaining the key for a subset of key columns when there is no subset (i.e., irows is null). This is the old method which is still used

if (haskey(x) && all(key(x) %chin% names(jval)) && is.sorted(jval, by=key(x)))

About the commit ids/docs my issue was with this line ... which is problematic because 3ca8373 is not the parent of the first commit in this PR. It is a parent of the last commit in this PR.

You are right - I should have updated the comment! I was too focused on what was the branch parent.

It is true that Slow and Fast (and other historical references) are not strictly necessary to see future regressions, and they could add visual clutter, but they also essential to answer an important question: is the current performance as good/bad (or better than) the bad/good performance we have seen in the past? (at the time the issue was created/fixed)

For me, the bad performance might still clutter things up in other PRs. After all, we fixed it! I can see your point about the good performance to showcase the best it has been although I would be OK without having the "good" performance as well.

@tdhock
Copy link
Member

tdhock commented Jul 14, 2024

If dt had a key, all names were in jsub, and it was verified to be sorted, a key would be retained. In this PR, the physical verification of sorted-ness is removed as well as retaining the key for a subset of key columns when there is no subset ...

So for the performance test, j=.SD so all names are in jsub, so the old code does a linear scan of the data to verify the sound-ness of the key? Whereas the new code avoids that linear scan? For me it is surprising that the key is involved at all when there is no i nor by arg.

the bad performance might still clutter things up in other PRs. After all, we fixed it! ...

you are right about the clutter but the benefit of being able to easily/immediately see the historical references outweighs the drawback of clutter.

We want to be able to see if future changes undo the historical fix. most of the time all of the different versions in a PR will all be the same as the "Fast" or "Fixed" commit but if there is some difference, we want to be able to easily/immediately see if it is as bad as we saw when the fix PR was merged.

A great example is the setDT performance test, which currently shows the result below. Without the "Slow" reference we may wonder if CRAN is faster or slower than the historical bad performance we had seen in the past. But with that reference it is clear that CRAN performance is the same as the historical bad performance that we are trying to fix.
image

@ColeMiller1
Copy link
Contributor Author

So for the performance test, j=.SD so all names are in jsub, so the old code does a linear scan of the data to verify the sound-ness of the key? Whereas the new code avoids that linear scan? For me it is surprising that the key is involved at all when there is no i nor by arg.

The idea is more that the new DT can safely retain a key. I mostly understand why this was implemented - someone spent the time to create a key initially and it would be good to retain it if possible.

dt = data.table(a = 1:2, key = 'a')
new_dt = dt[, .SD]   # or dt[, .(a)]

new_dt

## Key: <a>
##         a
##     <int>
##  1:     1
##  2:     2

address(dt) == address(new_dt)
## [1] FALSE

A test case (1102.3) where maybe we went too far with trying to retain keys is this one. I would propose at this point that the user can decide to recreate the key.

dt[, lapply(.SD, function(x) as.integer(x * 1000))]

##  Key: <a>
##         a
##     <int>
##  1:  1000
##  2:  2000

@tdhock
Copy link
Member

tdhock commented Jul 17, 2024

Hi, I added a performance test and resolved conflicts, please review. In particular the comment I wrote to explain the performance test expr arg is "New DT can safely retain key." Do you think that is sufficient?

Copy link
Member

@Anirban166 Anirban166 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the test!

.ci/atime/tests.R Outdated Show resolved Hide resolved
.ci/atime/tests.R Outdated Show resolved Hide resolved
@MichaelChirico MichaelChirico added this to the 1.16.0 milestone Jul 28, 2024
@MichaelChirico
Copy link
Member

Is the atime test here ready to go? I see some activity but comments still unresolved. @tdhock / @Anirban166

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I approve

@MichaelChirico MichaelChirico merged commit 680b5e8 into master Jul 29, 2024
5 checks passed
@MichaelChirico MichaelChirico deleted the fix_sorting_on_sorted branch July 29, 2024 16:55
@tdhock tdhock restored the fix_sorting_on_sorted branch July 29, 2024 19:02
@tdhock
Copy link
Member

tdhock commented Jul 29, 2024

restoring branch because it is required in an atime performance test, which errored recently: https://github.com/Rdatatable/data.table/actions/runs/10149565942/job/28064770069?pr=6288

Error in value[[3L]](cond) : 
  Error in revparse_single(object, branch): Error in 'git2r_revparse_single': Requested object could not be found
 when trying to checkout 353dc7a6b66563b61e44b2fa0d7b73a0f97ca461

restoring (un-deleting) this branch should fix that error.
please try to remember to not delete branches which have historical commits that are relevant to performance testing.

@MichaelChirico
Copy link
Member

Ah, thanks. It's hard to remember :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

is.sorted is slow for determining whether jval is sorted by key
6 participants