Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table melt memory efficiency #27

Closed
tdhock opened this issue Jan 30, 2024 · 6 comments
Closed

data.table melt memory efficiency #27

tdhock opened this issue Jan 30, 2024 · 6 comments

Comments

@tdhock
Copy link
Owner

tdhock commented Jan 30, 2024

verify that this PR actually is more memory efficient Rdatatable/data.table#5054

@Anirban166
Copy link
Collaborator

I created a test here but it shows no difference:

image

  # Improvement brought by: https://github.com/Rdatatable/data.table/pull/5054
  "melt improved in #5054" = atime::atime_test(
    N = 10^seq(1, 10),
    setup = {
      dt = data.table(x = sample(c(1:N, NA), N, replace = TRUE),
                      y = sample(c(letters, NA), N, replace = TRUE),
                      z = sample(c(NA, 1:N), N, replace = TRUE),
                      w = replicate(N, paste0(sample(letters, 20, replace = TRUE), collapse = "")),
                      v = I(lapply(1:N, function(i) list(sample(1:100, 10, replace = TRUE))))
                      )
    },
    expr = data.table:::melt(dt, id.vars = c("x", "z"), na.rm = TRUE),
    Slow = "fd24a3105953f7785ea7414678ed8e04524e6955", # Parent of the merge commit (https://github.com/Rdatatable/data.table/commit/ed72e398df76a0fcfd134a4ad92356690e4210ea) of the PR (https://github.com/Rdatatable/data.table/pull/5054) that brought the improvement
    Fast = "ed72e398df76a0fcfd134a4ad92356690e4210ea") # Merge commit of the PR (https://github.com/Rdatatable/data.table/pull/5054) that brought the improvement

Would you suggest me to revise this? (also, anything particular that I should test for here? I tried to have a lot of NAs as input, with na.rm = TRUE for melt)

@tdhock
Copy link
Owner Author

tdhock commented Sep 20, 2024

NAs in the input data table do not affect this PR, which improved efficiency when there were NAs in measure.vars argument of melt.

Using the code below

pkg.edit.fun = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
vres <- atime::atime_versions(
  pkg.path="~/R/data.table",
  pkg.edit.fun=pkg.edit.fun,
  setup={
    DT <- as.data.table(as.list(1:N))
    measure.vars <- lapply(1:N,function(i){
      x=rep(NA,N)
      x[i]=i
      x
    })
  },
  expr=data.table:::melt(DT, measure.vars=measure.vars),
  Slow = "fd24a3105953f7785ea7414678ed8e04524e6955", # Parent of the merge commit (https://github.com/Rdatatable/data.table/commit/ed72e398df76a0fcfd134a4ad92356690e4210ea) of the PR (https://github.com/Rdatatable/data.table/pull/5054) that brought the improvement
  Fast = "ed72e398df76a0fcfd134a4ad92356690e4210ea")

I get the result below, which indicates a small constant factor speedup, but I don't see any difference in memory usage.

image

@Anirban166
Copy link
Collaborator

Ah I should have had them in measure.vars my bad!

I get the result below, which indicates a small constant factor speedup, but I don't see any difference in memory usage.

Do you think it's worth adding a test case just for that improvement in time when we are anticipating a memory improvement? (I think we can skip this one, but up to you)
Thanks for having a look and making an example quick!

@tdhock
Copy link
Owner Author

tdhock commented Sep 20, 2024

yes please submit a PR

@Anirban166
Copy link
Collaborator

Sure!

@Anirban166
Copy link
Collaborator

Completed in Rdatatable/data.table#6526

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants