Fix fwrite length for gzip output #6393

philippechataignon · 2024-08-23T15:35:21Z

Closes #6356. Closes #5506.

This PR is an attempt to create a better gzip file with fwrite. Its an important rewrite because it includes some refactoring of actual code.

zlib

use Z_SYNC_FLUSH instead of Z_FINISH
create manual heading and write crc and len in tail
calc len and crc in thread and summarize in main thread
len in gzip specification is 32 bits and then is modulo 2 ^ 32 for uncompressed size > 4GiB

C code

simplify the implementation with only a #pragma omp parallel for for chunk loop and #pragma omp ordered for the writing and summarizing part.
Matt Dowle introduces the use of pool of buffers : the idea is generalized. The pools are created at the beginning and then uses for writing headers and rows. All the malloc occur early and no need for an header buffer.
Deobfuscate some part, especially if ( ) is followed by a new line, no =- or =*. Lot of work remains. Use of indent command ?
Remove some old debug code (msg)

* gzip length and crc are manually computed in each thread and then added/combined * gzip header is minimal * remove some old debug code

github-actions · 2024-08-23T15:53:37Z

Generated via commit cdf4277

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 3 minutes and 29 seconds

Time taken to run atime::atime_pkg on the tests: 9 minutes and 22 seconds

oliverfoster · 2024-08-23T18:44:19Z

len in gzip specification is 32 bits and then is false for uncompressed size > 4GiB

Is it not meant to be length modulo 4GiB rather than false? The remainder of dividing by 4GiB multiplied by 4GiB?

philippechataignon · 2024-08-24T08:38:56Z

You're right and this PR version stores the modulo 2**32 as requested but its not the right size.
Note that in new versions of gzip (after 1.12), the length stored in file is no more used. See note on gzip -l on https://savannah.gnu.org/news/?id=10121

7z l # (use header)
    Date      Time    Attr         Size   Compressed  Name
                     .....   1298954309   2212092276  mtcars.csv

% gzip --version                                                                                                                                                    
gzip 1.12
% gzip -lv mtcars.csv.gz  # (decompress and takes much more time)
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 5cd79282 Aug 24 10:19          2212092276          9888888901  77.6% mtcars.csv

philippechataignon · 2024-08-24T10:10:53Z

Put PR #5513 in this PR with new param compressLevel.

Anirban166 · 2024-09-02T20:10:21Z

It definitely is appropriate!

we want to make sure it hasn't slowed down.

For me, this is the main point of the regression tests (i.e., not strictly aiming to target historical cases, but rather future-proofing the current performance or ensuring things do not become slower in the long run) so not having a 'Slow' version/label at present sounds acceptable

MichaelChirico · 2024-09-02T20:41:27Z

not having a 'Slow' version/label at present sounds acceptable

OK, I think that's where I was stuck -- which labels are appropriate for such a use case.

I guess I thought some of them were required, but looking at atime_test() help I guess all of them are optional:

https://github.com/tdhock/atime/blob/main/man/atime_versions.Rd

What labels are best here, then, Before and After? But in 2 years it becomes "before and after what?"

NEWS.md

Anirban166 · 2024-09-02T21:01:56Z

I suppose we'll have to see if this makes it faster or just retains the same performance - if it's the former, then clearly 'Slow'/'Fast', but if it's the latter (which is what we're at right now), then I'd say maybe 'Before' and 'Current'?

But in 2 years it becomes "before and after what?"

Yup 'After' will become redundant soon in context

Anirban166 · 2024-09-02T21:13:20Z

Before too would become redundant as you noted, and since the main reason for having these tests would be for a significant refactor and we could have any number of those in the long run, I think adding in more context to the label names would make sense (makes them longer but I think we'll be able to find a fine line between readability and good enough context to follow up)

Currently, I'm thinking 'Baseline' and 'Post-gzip refactor'

tdhock · 2024-09-03T01:48:01Z

What labels are best here, then, Before and After? But in 2 years it becomes "before and after what?"

I documented the current process here, https://github.com/Rdatatable/data.table/wiki/Performance-testing
it says "make sure to use the names Slow/Fast, or Before/Regression/Fixed, which are assigned special colors, whereas any other names are shown in grey."

So in this case, since there is no regression, Before/Regression/Fixed should not be used, to avoid confusion with other cases that actually are real historical regressions.

I would suggest Fast for the old commit before this PR. (no Slow necessary)

Anirban166 · 2024-09-03T01:55:26Z

Sounds reasonable for this PR (to compare 'Fast' and 'HEAD'), but then will we be only running the old commit for the regression test we make for this case after this PR has been merged?

MichaelChirico · 2024-09-03T04:39:32Z

I documented the current process here,

Thanks Toby, I had looked at the .ci/atime/tests.R script and some {atime} documentation directly and didn't think to check the Wiki. Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R?

MichaelChirico · 2024-09-03T04:44:57Z

@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

tdhock · 2024-09-03T12:59:30Z

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

philippechataignon · 2024-09-03T14:43:49Z

@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version.

One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen. Like a lot of persons, I have a options(scipen = 999) in my Rprofile because I don't like sci output on screen but I never realize that fwrite was penalized. Why ? Because of the maxLineLen which can be very high and gives a lot of batches with few lines and little chunk. In fwrite, scipen has an maximum of 350 but it's a very high limit. And maxLineLen is used for determining number and sizes of fwrite chunks.

For testing impact, I have this little program :

n = 10000
ncol = 1000
dt <- data.table(i=1:n)
dt[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) as.numeric(sample(1:n, replace=T)))]
print(sessionInfo())
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=0))
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=999))

With scipen = 0

maxLineLen=61026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 73 batches of 137 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=21719798 (20 MiB), ratio=44.4%, crc=26a23b0a
Written 10000 rows in 1.569 secs using 4 threads. MaxBuffUsed=7%

With scipen = 999

maxLineLen=761026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 910 batches of 11 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=22745125 (21 MiB), ratio=46.5%, crc=26a23b0a
Written 10000 rows in 1.202 secs using 4 threads. MaxBuffUsed=0%

In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead.

In my opinion, scipen parameter should be 0 in fwrite and not scipen option witch is related to digits option not present in fwrite and digits is higher in fwrite output (20 ?).

I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :

  # Issue with fwrite length for gzip output, fixed in: https://github.com/Rdatatable/data.table/pull/6393
  # No regression timing test
  "No regression fwrite" = atime::atime_test(
    N = 10^seq(2, 8),
    setup = {
      set.seed(1)
      ncol = 1000
      L <- data.table(i=1:N)
      L[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) rnorm(N))]
    },
    expr = {
      fwrite(dt, "/dev/null", compress="gzip")
    },
    Fast = "117ab45674f1e56304abca83f9f0df50ab0274be", # Close-to-last merge commit in the PR
    Slow = "e73c2c849f921cf4ef51e3809842e0fee9b9f52c"),

but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing.

OK for another one to continue and test that there is not time regression.

Anirban166 · 2024-09-03T19:14:58Z

I documented the current process here,

Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R?

2 and 3 sounds good to me

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

Should I go ahead and make a PR for this quick addition?

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

I agree, both for being able to include images and in case we miss out on something that other people notice, they should be able to fill in points quickly

MichaelChirico · 2024-09-03T20:24:27Z

I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen.

Moved this to #6457 for further discussion, I think it's out of scope here. Thanks!

src/fwrite.c

tdhock · 2024-09-03T20:39:39Z

but I'm not sure that /dev/null is portable

this only has to run on github actions ubuntu vm, so /dev/null should be ok in principle, but I changed it to tempfile() which should be fine too.

Thanks for sharing your code for scipen benchmarking. I adapted it to get the following atime result, which indicates little to no impact on computation time, but a small constant factor increase in memory usage.

edit.data.table = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
out.csv <- tempfile()
issue6393 <- atime::atime_versions(
  "~/R/data.table",
  N = 2^seq(1, 20),
  pkg.edit.fun=edit.data.table,
  setup = {
    set.seed(1)
    NC = 10
    L <- data.table(i=1:N)
    L[, paste0("V", 1:NC) := replicate(NC, rnorm(N), simplify=FALSE)]
  },
  expr = {
    data.table::fwrite(L, out.csv, compress="gzip")
  },
  Fast="f339aa64c426a9cd7cf2fcb13d91fc4ed353cd31", # Parent of the first commit https://github.com/Rdatatable/data.table/commit/fcc10d73a20837d0f1ad3278ee9168473afa5ff1 in the PR https://github.com/Rdatatable/data.table/pull/6393/commits with major change to fwrite with gzip.
  PR = "117ab45674f1e56304abca83f9f0df50ab0274be") # Close-to-last merge commit in the PR.
plot(issue6393)

Co-authored-by: Michael Chirico <chiricom@google.com>

codecov · 2024-09-27T21:34:47Z

Codecov Report

Attention: Patch coverage is 91.85185% with 11 lines in your changes missing coverage. Please review.

Project coverage is 98.55%. Comparing base (3734726) to head (cdf4277).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
src/fwrite.c	91.12%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6393      +/-   ##
==========================================
- Coverage   98.62%   98.55%   -0.08%     
==========================================
  Files          79       79              
  Lines       14448    14503      +55     
==========================================
+ Hits        14249    14293      +44     
- Misses        199      210      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

philippechataignon added 8 commits August 23, 2024 00:15

fwrite with correct file length

fcc10d7

* gzip length and crc are manually computed in each thread and then added/combined * gzip header is minimal * remove some old debug code

Escape with NOZLIB for compilation succeed without zlib

97c919b

Move zlib check at start to avoid oufile deletion

4914d7a

Indent and add comments

d9e9e92

Buffers unification

c8970e1

Restore schedule(dynamic) more efficient and progress

2fc184e

Use alloc_size to see allocation when verbose

47c16f2

Test if stream init succeded

3dcb7d6

philippechataignon requested a review from MichaelChirico as a code owner August 23, 2024 15:35

philippechataignon marked this pull request as draft August 23, 2024 15:37

philippechataignon mentioned this pull request Aug 23, 2024

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

Open

philippechataignon added 2 commits August 23, 2024 18:21

Add cast to avoid warnings on Windows

6569cfe

More explicit timing messages

bae11e3

philippechataignon and others added 5 commits August 24, 2024 12:05

Free stream structs

5ce31f3

Add option to control compression level for fwrite with gzip

b07306f

Rework namings and default value

8722519

Rename gzipLevel to compressLevel

0800f10

compressLevel param documentation

a0a9c1f

philippechataignon added 8 commits August 25, 2024 11:29

Put zlib initialization together

a00761d

Refact buffSize, numBatchs and numBatches

aaf85ab

Add missing NOZLIB

478b862

Increase outputs in last message when verbose

b09aa34

No real init for stream_thread when is_gzip false

669eb01

Minor corrections

5668431

Uses %zu format for size_t

e9e2e83

Last verbose msg was not printed when not is_gzip

7fb8738

MichaelChirico added 2 commits September 2, 2024 12:42

fix order

47464db

weak ordering

e2e7022

MichaelChirico added 2 commits September 2, 2024 13:46

Merge branch 'master' into fix_fwrite_length

c4971f5

place in 1.17.0 NEWS

6bebfc2

MichaelChirico reviewed Sep 2, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Add parenthesis to be more explicit

5687a0c

Add comment for DeflateInit2

117ab45

This was referenced Sep 3, 2024

Mention wiki in comments #6456

Merged

Reconsider handling of scipen in fwrite #6457

Open

MichaelChirico reviewed Sep 3, 2024

View reviewed changes

src/fwrite.c Outdated Show resolved Hide resolved

typo

4e91a21

MichaelChirico reviewed Sep 3, 2024

View reviewed changes

src/fwrite.c Outdated Show resolved Hide resolved

Merge branch 'master' into fix_fwrite_length

45bf1d4

MichaelChirico mentioned this pull request Sep 4, 2024

control compression level for fwrite with gzip #5513

Closed

3 tasks

philippechataignon and others added 2 commits September 4, 2024 11:49

Add parenthesis to be more explicit (2)

255f1ce

Co-authored-by: Michael Chirico <chiricom@google.com>

Merge branch 'master' into fix_fwrite_length

cdf4277

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fwrite length for gzip output #6393

Fix fwrite length for gzip output #6393

philippechataignon commented Aug 23, 2024 •

edited

Loading

github-actions bot commented Aug 23, 2024 •

edited

Loading

oliverfoster commented Aug 23, 2024 •

edited

Loading

philippechataignon commented Aug 24, 2024 •

edited

Loading

philippechataignon commented Aug 24, 2024

Anirban166 commented Sep 2, 2024 •

edited

Loading

MichaelChirico commented Sep 2, 2024

Anirban166 commented Sep 2, 2024

Anirban166 commented Sep 2, 2024

tdhock commented Sep 3, 2024

Anirban166 commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024

philippechataignon commented Sep 3, 2024

Anirban166 commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading

Fix fwrite length for gzip output #6393

Are you sure you want to change the base?

Fix fwrite length for gzip output #6393

Conversation

philippechataignon commented Aug 23, 2024 • edited Loading

zlib

C code

github-actions bot commented Aug 23, 2024 • edited Loading

oliverfoster commented Aug 23, 2024 • edited Loading

philippechataignon commented Aug 24, 2024 • edited Loading

philippechataignon commented Aug 24, 2024

Anirban166 commented Sep 2, 2024 • edited Loading

MichaelChirico commented Sep 2, 2024

Anirban166 commented Sep 2, 2024

Anirban166 commented Sep 2, 2024

tdhock commented Sep 3, 2024

Anirban166 commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024

philippechataignon commented Sep 3, 2024

Anirban166 commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024 • edited Loading

codecov bot commented Sep 27, 2024 • edited Loading

Codecov Report

philippechataignon commented Aug 23, 2024 •

edited

Loading

github-actions bot commented Aug 23, 2024 •

edited

Loading

oliverfoster commented Aug 23, 2024 •

edited

Loading

philippechataignon commented Aug 24, 2024 •

edited

Loading

Anirban166 commented Sep 2, 2024 •

edited

Loading

tdhock commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading