Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control compression level for fwrite with gzip #5513

Closed
wants to merge 2 commits into from
Closed

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Nov 4, 2022

Closes #5506

Still need to discuss the name of new argument and argument order for fwrite

  • implementation
  • tests
  • NEWS
library(data.table)
DT = data.table(a=sample(10L, 1000L, TRUE), b=rep(1:4,each=250))
fwrite(DT, file=f1<-tempfile(fileext=".gz"), gzip_ratio=1L)
fwrite(DT, file=f2<-tempfile(fileext=".gz"), gzip_ratio=2L)
fwrite(DT, file=f3<-tempfile(fileext=".gz"), gzip_ratio=3L)
fwrite(DT, file=f4<-tempfile(fileext=".gz"), gzip_ratio=4L)
fwrite(DT, file=f5<-tempfile(fileext=".gz"), gzip_ratio=5L)
fwrite(DT, file=f6<-tempfile(fileext=".gz"), gzip_ratio=6L)
fwrite(DT, file=f7<-tempfile(fileext=".gz"), gzip_ratio=7L)
fwrite(DT, file=f8<-tempfile(fileext=".gz"), gzip_ratio=8L)
fwrite(DT, file=f9<-tempfile(fileext=".gz"), gzip_ratio=9L)
file.size(f1, f2, f3, f4, f5, f6, f7, f8, f9)
#> [1] 1055 1021 1009  934  900  871  871  863  863

@jangorecki
Copy link
Member

I think compression ratio is something that fits well into use of a global option in R. So it can be applied to all packages using dt's fwrite compression. Then we don't need new argument. I am not saying to amend PR for that yet, just raising an idea for discussion.

@ben-schwen
Copy link
Member Author

I think compression ratio is something that fits well into use of a global option in R.

That was also my first thought, but I think I read smth about discouraging the introduction of new global options. A new global argument would also make everything cleaner since I do not throw any errors if a gzip_ratio is chosen with another or no compression.

@MichaelChirico
Copy link
Member

Global option looks less appealing to me since this is only used in one place / doesn't appear to correspond neatly to any R-side concept (it's just an argument to deflateInit2() from the C library).

So we could do gzip_ratio = getOption('datatable.gzip_ratio', NULL), but it's not clear where else that option would be relevant.

PS,

(1) Why 'ratio'? The manual names the deflateInit2() argument as level, why not gzip_level? https://www.zlib.net/manual.html
(2) Are we sure we're making a back-compatible change here? IINM the default per this PR is 0, which is not Z_DEFAULT_COMPRESSION, which the manual gives as "currently 6" https://www.zlib.net/manual.html

@ben-schwen
Copy link
Member Author

(1) Why 'ratio'? The manual names the deflateInit2() argument as level, why not gzip_level? https://www.zlib.net/manual.html (2) Are we sure we're making a back-compatible change here? IINM the default per this PR is 0, which is not Z_DEFAULT_COMPRESSION, which the manual gives as "currently 6" https://www.zlib.net/manual.html

Ad (1) true that, gzip_level is a better name, will change
Ad (2) a compression level of 0 is currently set to Z_DEFAULT_COMPRESSION (in the deflateInit call) to stay in par if zlib ever changes the default level. I misread the documentation and thought that only values in 1:9 are valid, but apparently, 0 is also possible. Will change default value to -1 and set it to Z_DEFAULT_COMPRESSION

R/fwrite.R Outdated
@@ -10,6 +10,7 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
buffMB=8, nThread=getDTthreads(verbose),
showProgress=getOption("datatable.showProgress", interactive()),
compress = c("auto", "none", "gzip"),
gzip_ratio = 0:9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use camel case rather than underline here

@codecov
Copy link

codecov bot commented Nov 15, 2022

Codecov Report

Merging #5513 (b8bbb72) into master (a4c2b01) will decrease coverage by 1.97%.
The diff coverage is 76.82%.

❗ Current head b8bbb72 differs from pull request most recent head 57b4d3a. Consider uploading reports for the commit 57b4d3a to get more accurate results

@@            Coverage Diff             @@
##           master    #5513      +/-   ##
==========================================
- Coverage   99.51%   97.53%   -1.98%     
==========================================
  Files          80       80              
  Lines       14773    14805      +32     
==========================================
- Hits        14701    14440     -261     
- Misses         72      365     +293     
Impacted Files Coverage Δ
src/init.c 97.95% <0.00%> (-2.05%) ⬇️
R/test.data.table.R 92.06% <58.33%> (-7.94%) ⬇️
R/tables.R 94.28% <93.54%> (-5.72%) ⬇️
R/fwrite.R 80.24% <100.00%> (-19.76%) ⬇️
R/utils.R 100.00% <100.00%> (ø)
src/freadR.c 100.00% <100.00%> (ø)
src/fwrite.c 92.38% <100.00%> (-4.73%) ⬇️
src/fwriteR.c 100.00% <100.00%> (ø)
R/xts.R 0.00% <0.00%> (-100.00%) ⬇️
R/last.R 47.91% <0.00%> (-52.09%) ⬇️
... and 8 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@MichaelChirico
Copy link
Member

Superseded by #6393, PTAL there.

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 4, 2024

It looks like the only differences are:

  1. Argument name, Fix fwrite length for gzip output #6393 went with compressLevel, which I like a bit better
  2. Argument default, Fix fwrite length for gzip output #6393 went with 6L vs. -1:9 here. I think just using 6L is more straightforward, even though the one here does illustrate all valid values concisely. OTOH, it suggests 0L is valid, which it's not, really we'd want c(-1L, 1:9) which isn't so nice. In that case c(6L, 1:5, 7:9) might be better, but still I think just 6L is fine.

Let's continue discussion there.

@MichaelChirico MichaelChirico deleted the gzip_ratio branch September 4, 2024 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set compression level for gzipped output in fwrite
3 participants