Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fwrite length for gzip output #6393

Open
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

philippechataignon
Copy link
Contributor

@philippechataignon philippechataignon commented Aug 23, 2024

Closes #6356. Closes #5506.

This PR is an attempt to create a better gzip file with fwrite. Its an important rewrite because it includes some refactoring of actual code.

zlib

  • use Z_SYNC_FLUSH instead of Z_FINISH
  • create manual heading and write crc and len in tail
  • calc len and crc in thread and summarize in main thread
  • len in gzip specification is 32 bits and then is modulo 2 ^ 32 for uncompressed size > 4GiB

C code

  • simplify the implementation with only a #pragma omp parallel for for chunk loop and #pragma omp ordered for the writing and summarizing part.
  • Matt Dowle introduces the use of pool of buffers : the idea is generalized. The pools are created at the beginning and then uses for writing headers and rows. All the malloc occur early and no need for an header buffer.
  • Deobfuscate some part, especially if ( ) is followed by a new line, no =- or =*. Lot of work remains. Use of indent command ?
  • Remove some old debug code (msg)

Copy link

github-actions bot commented Aug 23, 2024

Comparison Plot

Generated via commit cdf4277

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 3 minutes and 29 seconds

Time taken to run atime::atime_pkg on the tests: 9 minutes and 22 seconds

@oliverfoster
Copy link

oliverfoster commented Aug 23, 2024

len in gzip specification is 32 bits and then is false for uncompressed size > 4GiB

Is it not meant to be length modulo 4GiB rather than false? The remainder of dividing by 4GiB multiplied by 4GiB?

Screenshot_20240823-194218

@philippechataignon
Copy link
Contributor Author

philippechataignon commented Aug 24, 2024

You're right and this PR version stores the modulo 2**32 as requested but its not the right size.
Note that in new versions of gzip (after 1.12), the length stored in file is no more used. See note on gzip -l on https://savannah.gnu.org/news/?id=10121

7z l # (use header)
    Date      Time    Attr         Size   Compressed  Name
                     .....   1298954309   2212092276  mtcars.csv

% gzip --version                                                                                                                                                    
gzip 1.12
% gzip -lv mtcars.csv.gz  # (decompress and takes much more time)
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 5cd79282 Aug 24 10:19          2212092276          9888888901  77.6% mtcars.csv

@philippechataignon
Copy link
Contributor Author

Put PR #5513 in this PR with new param compressLevel.

@Anirban166
Copy link
Member

Anirban166 commented Sep 2, 2024

It definitely is appropriate!

we want to make sure it hasn't slowed down.

For me, this is the main point of the regression tests (i.e., not strictly aiming to target historical cases, but rather future-proofing the current performance or ensuring things do not become slower in the long run) so not having a 'Slow' version/label at present sounds acceptable

@MichaelChirico
Copy link
Member

not having a 'Slow' version/label at present sounds acceptable

OK, I think that's where I was stuck -- which labels are appropriate for such a use case.

I guess I thought some of them were required, but looking at atime_test() help I guess all of them are optional:

https://github.com/tdhock/atime/blob/main/man/atime_versions.Rd

What labels are best here, then, Before and After? But in 2 years it becomes "before and after what?"

NEWS.md Outdated Show resolved Hide resolved
@Anirban166
Copy link
Member

I suppose we'll have to see if this makes it faster or just retains the same performance - if it's the former, then clearly 'Slow'/'Fast', but if it's the latter (which is what we're at right now), then I'd say maybe 'Before' and 'Current'?

But in 2 years it becomes "before and after what?"

Yup 'After' will become redundant soon in context

@Anirban166
Copy link
Member

Before too would become redundant as you noted, and since the main reason for having these tests would be for a significant refactor and we could have any number of those in the long run, I think adding in more context to the label names would make sense (makes them longer but I think we'll be able to find a fine line between readability and good enough context to follow up)

Currently, I'm thinking 'Baseline' and 'Post-gzip refactor'

@tdhock
Copy link
Member

tdhock commented Sep 3, 2024

What labels are best here, then, Before and After? But in 2 years it becomes "before and after what?"

I documented the current process here, https://github.com/Rdatatable/data.table/wiki/Performance-testing
it says "make sure to use the names Slow/Fast, or Before/Regression/Fixed, which are assigned special colors, whereas any other names are shown in grey."

So in this case, since there is no regression, Before/Regression/Fixed should not be used, to avoid confusion with other cases that actually are real historical regressions.

I would suggest Fast for the old commit before this PR. (no Slow necessary)

@Anirban166
Copy link
Member

Sounds reasonable for this PR (to compare 'Fast' and 'HEAD'), but then will we be only running the old commit for the regression test we make for this case after this PR has been merged?

@MichaelChirico
Copy link
Member

I documented the current process here,

Thanks Toby, I had looked at the .ci/atime/tests.R script and some {atime} documentation directly and didn't think to check the Wiki. Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R?

@MichaelChirico
Copy link
Member

@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

@tdhock
Copy link
Member

tdhock commented Sep 3, 2024

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

@philippechataignon
Copy link
Contributor Author

@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version.

One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen. Like a lot of persons, I have a options(scipen = 999) in my Rprofile because I don't like sci output on screen but I never realize that fwrite was penalized. Why ? Because of the maxLineLen which can be very high and gives a lot of batches with few lines and little chunk. In fwrite, scipen has an maximum of 350 but it's a very high limit. And maxLineLen is used for determining number and sizes of fwrite chunks.

For testing impact, I have this little program :

n = 10000
ncol = 1000
dt <- data.table(i=1:n)
dt[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) as.numeric(sample(1:n, replace=T)))]
print(sessionInfo())
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=0))
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=999))

With scipen = 0

maxLineLen=61026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 73 batches of 137 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=21719798 (20 MiB), ratio=44.4%, crc=26a23b0a
Written 10000 rows in 1.569 secs using 4 threads. MaxBuffUsed=7%

With scipen = 999

maxLineLen=761026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 910 batches of 11 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=22745125 (21 MiB), ratio=46.5%, crc=26a23b0a
Written 10000 rows in 1.202 secs using 4 threads. MaxBuffUsed=0%

In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead.

In my opinion, scipen parameter should be 0 in fwrite and not scipen option witch is related to digits option not present in fwrite and digits is higher in fwrite output (20 ?).

I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :

  # Issue with fwrite length for gzip output, fixed in: https://github.com/Rdatatable/data.table/pull/6393
  # No regression timing test
  "No regression fwrite" = atime::atime_test(
    N = 10^seq(2, 8),
    setup = {
      set.seed(1)
      ncol = 1000
      L <- data.table(i=1:N)
      L[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) rnorm(N))]
    },
    expr = {
      fwrite(dt, "/dev/null", compress="gzip")
    },
    Fast = "117ab45674f1e56304abca83f9f0df50ab0274be", # Close-to-last merge commit in the PR
    Slow = "e73c2c849f921cf4ef51e3809842e0fee9b9f52c"),

but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing.

OK for another one to continue and test that there is not time regression.

@Anirban166
Copy link
Member

I documented the current process here,

Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R?

2 and 3 sounds good to me

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

Should I go ahead and make a PR for this quick addition?

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

I agree, both for being able to include images and in case we miss out on something that other people notice, they should be able to fill in points quickly

@MichaelChirico
Copy link
Member

I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen.

Moved this to #6457 for further discussion, I think it's out of scope here. Thanks!

src/fwrite.c Outdated Show resolved Hide resolved
src/fwrite.c Outdated Show resolved Hide resolved
@tdhock
Copy link
Member

tdhock commented Sep 3, 2024

but I'm not sure that /dev/null is portable

this only has to run on github actions ubuntu vm, so /dev/null should be ok in principle, but I changed it to tempfile() which should be fine too.

Thanks for sharing your code for scipen benchmarking. I adapted it to get the following atime result, which indicates little to no impact on computation time, but a small constant factor increase in memory usage.

image

edit.data.table = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
out.csv <- tempfile()
issue6393 <- atime::atime_versions(
  "~/R/data.table",
  N = 2^seq(1, 20),
  pkg.edit.fun=edit.data.table,
  setup = {
    set.seed(1)
    NC = 10
    L <- data.table(i=1:N)
    L[, paste0("V", 1:NC) := replicate(NC, rnorm(N), simplify=FALSE)]
  },
  expr = {
    data.table::fwrite(L, out.csv, compress="gzip")
  },
  Fast="f339aa64c426a9cd7cf2fcb13d91fc4ed353cd31", # Parent of the first commit https://github.com/Rdatatable/data.table/commit/fcc10d73a20837d0f1ad3278ee9168473afa5ff1 in the PR https://github.com/Rdatatable/data.table/pull/6393/commits with major change to fwrite with gzip.
  PR = "117ab45674f1e56304abca83f9f0df50ab0274be") # Close-to-last merge commit in the PR.
plot(issue6393)

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 91.85185% with 11 lines in your changes missing coverage. Please review.

Project coverage is 98.55%. Comparing base (3734726) to head (cdf4277).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/fwrite.c 91.12% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6393      +/-   ##
==========================================
- Coverage   98.62%   98.55%   -0.08%     
==========================================
  Files          79       79              
  Lines       14448    14503      +55     
==========================================
+ Hits        14249    14293      +44     
- Misses        199      210      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants