data.table melt memory efficiency #27

tdhock · 2024-01-30T18:26:30Z

verify that this PR actually is more memory efficient Rdatatable/data.table#5054

Anirban166 · 2024-09-20T17:40:11Z

I created a test here but it shows no difference:

  # Improvement brought by: https://github.com/Rdatatable/data.table/pull/5054
  "melt improved in #5054" = atime::atime_test(
    N = 10^seq(1, 10),
    setup = {
      dt = data.table(x = sample(c(1:N, NA), N, replace = TRUE),
                      y = sample(c(letters, NA), N, replace = TRUE),
                      z = sample(c(NA, 1:N), N, replace = TRUE),
                      w = replicate(N, paste0(sample(letters, 20, replace = TRUE), collapse = "")),
                      v = I(lapply(1:N, function(i) list(sample(1:100, 10, replace = TRUE))))
                      )
    },
    expr = data.table:::melt(dt, id.vars = c("x", "z"), na.rm = TRUE),
    Slow = "fd24a3105953f7785ea7414678ed8e04524e6955", # Parent of the merge commit (https://github.com/Rdatatable/data.table/commit/ed72e398df76a0fcfd134a4ad92356690e4210ea) of the PR (https://github.com/Rdatatable/data.table/pull/5054) that brought the improvement
    Fast = "ed72e398df76a0fcfd134a4ad92356690e4210ea") # Merge commit of the PR (https://github.com/Rdatatable/data.table/pull/5054) that brought the improvement

Would you suggest me to revise this? (also, anything particular that I should test for here? I tried to have a lot of NAs as input, with na.rm = TRUE for melt)

tdhock · 2024-09-20T18:13:03Z

NAs in the input data table do not affect this PR, which improved efficiency when there were NAs in measure.vars argument of melt.

Using the code below

pkg.edit.fun = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
vres <- atime::atime_versions(
  pkg.path="~/R/data.table",
  pkg.edit.fun=pkg.edit.fun,
  setup={
    DT <- as.data.table(as.list(1:N))
    measure.vars <- lapply(1:N,function(i){
      x=rep(NA,N)
      x[i]=i
      x
    })
  },
  expr=data.table:::melt(DT, measure.vars=measure.vars),
  Slow = "fd24a3105953f7785ea7414678ed8e04524e6955", # Parent of the merge commit (https://github.com/Rdatatable/data.table/commit/ed72e398df76a0fcfd134a4ad92356690e4210ea) of the PR (https://github.com/Rdatatable/data.table/pull/5054) that brought the improvement
  Fast = "ed72e398df76a0fcfd134a4ad92356690e4210ea")

I get the result below, which indicates a small constant factor speedup, but I don't see any difference in memory usage.

Anirban166 · 2024-09-20T18:22:08Z

Ah I should have had them in measure.vars my bad!

I get the result below, which indicates a small constant factor speedup, but I don't see any difference in memory usage.

Do you think it's worth adding a test case just for that improvement in time when we are anticipating a memory improvement? (I think we can skip this one, but up to you)
Thanks for having a look and making an example quick!

tdhock · 2024-09-20T18:25:45Z

yes please submit a PR

Anirban166 · 2024-09-20T18:26:24Z

Sure!

Anirban166 · 2024-09-24T01:49:52Z

Completed in Rdatatable/data.table#6526

Anirban166 closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table melt memory efficiency #27

data.table melt memory efficiency #27

tdhock commented Jan 30, 2024

Anirban166 commented Sep 20, 2024

tdhock commented Sep 20, 2024

Anirban166 commented Sep 20, 2024

tdhock commented Sep 20, 2024

Anirban166 commented Sep 20, 2024

Anirban166 commented Sep 24, 2024

data.table melt memory efficiency #27

data.table melt memory efficiency #27

Comments

tdhock commented Jan 30, 2024

Anirban166 commented Sep 20, 2024

tdhock commented Sep 20, 2024

Anirban166 commented Sep 20, 2024

tdhock commented Sep 20, 2024

Anirban166 commented Sep 20, 2024

Anirban166 commented Sep 24, 2024