Going on with the discussion about long-distance redundancy removal from ckolivas/lrzip#214 #56

giovariot · 2021-12-28T17:45:45Z

giovariot
Dec 28, 2021

Going on from the ckolivas/lrzip#214 @pete4abw last reply:

rzip is not used. rzip functions are. This is the first pass which creates hash indexes of data over long distances. Your idea of "one copy of the data" is a little simplistic. The rzip pass evaluates data as it comes. It is stream-based, not file based. Theoretically, your file, or parts of it, may be hashed depending on where in the stream of data it occurs how large the stream chunk is. I think the largest evaluation block in rzip is 64MB but I could be wrong.

Ok, so there's no way any software could be able to remove this kind of redundancy?

The -R# option defines the level of the rzip pass. By default it is the same of the compression level selected for lrzip or lrzip-next. As with any compression method, the higher the number, the better the compression. So you could use lrzip-next with the L4 option and -R9 option, meaning the first pass will be more thorough even though the backend compression won't be tasked with a high level.

I thought rzip compression pass meant only redundancy retrieval and removal, but really don't know enough about compression at all.

pete4abw · 2021-12-28T18:23:11Z

pete4abw
Dec 28, 2021
Maintainer

Hi and thanks for moving this over. lrzip-next is quite different and has new options that would only confuse lrzip users.

The rzip pass DOES remove redundancies. In fact, all compression programs try and do this. rzip is special because it works over longer ranges. lrzip/-next are special because it passes this reduced and hashed stream to the backend compressor. But it also splits the work into chunks and blocks. This spreads the work of the backend compressor out among different threads. Some people use the option -Up1 which forces one thread to be used and gives the backend the best chance to compress and find its own redundancies.

The rzip application uses two passes. The first pass is identical to lrzip/next except the compression window is much larger than its defined 900MB. The second pass of rzip uses bzip2.

I assume you read the Wikipedia article on rzip and The rzip page.

One suggestion to help you appreciate what the rzip pass can do, is to use the lrzip-next -R# -n file command. # can be 1-9 and the output will be just the pass 1 output. You can compare the sizes.

Here's output from the enwik9 compression test file

-R1

  Stats         Percent       Compressed /   Uncompressed
  -------------------------------------------------------
  Rzip:          99.3%       992,881,462 /  1,000,000,000

-R9

  Stats         Percent       Compressed /   Uncompressed
  -------------------------------------------------------
  Rzip:          90.8%       908,203,127 /  1,000,000,000

There's a lot to learn. But you can see, the impact of rzip pass above. The hash indexes are stored always in Stream 0. Stream 1 contains actual data.

1 reply

giovariot Dec 28, 2021
Author

So wait a sec: I still haven't understood what you were referring to with your previous message

rzip is the old software here that does a first pass of deduplication over a well defined 900MB windows and then compresses with bzip2

lrzip does 2 passes:

a deduplication process but threads it so if it's not a single thread only it won't perfectly find every duplication as every thread will only hash and dedupe its own chunk of the file. (So to dedupe the way I want you're suggesting me to use the -Up1 option to get only one thread as the hashing and deduping will find duplicates from the whole disk lenght, correct? Is this an lrzip o lrzip-next option? Couldn't find -U in lrzip-next man pages)
a compression process using one of multiple compressing algorithms from very fast focused ones like lzo to very space focused ones like zpaq

lrzip-next has newer optimisations compared to lrzip, and also have control over the rzip pass with this -R option. By saying so you're saying that the first pass is already compressing after deduplication? Is this an rzip algorithms pass or a proper rzip compression with the dedupe and bzip2 (or other) compression? Coz if it's only the hashing and redundance-freeing algorithm like in lrzip I'm wondering what the -R compression level would define here

How can I use lrzip-next to remove those sort of long distance duplications?

Sorry again for being a petty ignorant, but I'm actually a bit confused

pete4abw · 2021-12-28T19:45:43Z

pete4abw
Dec 28, 2021
Maintainer

Not quite. And I hate big words. The first pass of lrzip/-next is the first pass of rzip, but extended well beyond 900MB. The entire compression window is used. If you use -U that would be the equivalent of all system ram + swap. One thread is dedicated to pass 1. The remaining threads are for backend compression. The result of pass 1 is a compressed stream based on the hashing (deduplication) of the data. Stream 0 of an lrzip/-next archive is the hashing of the data and Stream 1 is the output of the backend compression.

The backend receives both the hashed data and the Stream 1 data. The backend will even compress the Stream 0 hashes.

How can I use lrzip-next to remove those sort of long distance duplications?

lrzip/-next DOES remove long distance duplications. If data at the end of your image, when hashed, matches a hash value already computed, it will be flagged as such and not be a part of Stream 1. When uncompressed, the hashes are decoded after returning from backend decompression.

rzip -- > backend = lrzip-next file
backend -- > runzip = uncompressed file

But this is limited by total system ram. So the hypothetical example of a 1TB disk image will be processed in chunks limited in size to the total compression window. Each chunk has its own pass 1. Chunks are not connected and hashing starts over.

Consider this output from AOSP which was 22GB in size. Each Stream 0 is processed separately. The net result is the original data is reduced by 34% prior to being sent to the backend. The backend reduced the already deduped data by a further 59%. Overall, there was a 73% reduction in size.

One other very significant enhancement of lrzip-next is the computation of block sizes and dictionary sizes. Read the Wiki here and other data on the web to learn more. Good luck.

lrzip-next -vvi aosp-7.1.2.tar.8.9.lrz
Using configuration file /home/peter/.lrzip/lrzip.conf
Detected lrzip version 0.8 file.
Rzip chunk:       1
Chunk byte width: 5
Chunk size:       11,038,101,504
Stream: 0
Offset: 25
Block   Comp    Percent        Comp Size /     UComp Size            Offset :           Head
1       lzma     48.0%        13,240,669 /     27,608,290     3,692,846,331 :              0
Stream: 1
Offset: 41
Block   Comp    Percent        Comp Size /     UComp Size            Offset :           Head
1       lzma     28.1%       299,180,505 /  1,063,272,448                57 :    299,180,553
2       lzma     52.7%       560,021,662 /  1,063,272,448       299,180,578 :    859,202,231
3       lzma     32.0%       339,735,362 /  1,063,272,448       859,202,256 :  1,198,937,609
4       lzma     44.5%       473,664,153 /  1,063,272,448     1,198,937,634 :  1,672,601,778
5       lzma     46.7%       496,945,441 /  1,063,272,448     1,672,601,803 :  2,169,547,235
6       lzma     62.0%       659,608,552 /  1,063,272,448     2,169,547,260 :  2,829,155,803
7       lzma     47.9%       509,168,593 /  1,063,272,448     2,829,155,828 :  3,338,324,412
8       lzma     33.3%       354,521,878 /  1,063,272,448     3,338,324,437 :  3,706,086,991
9       lzma     26.5%         5,692,357 /     21,507,639     3,706,087,016 :              0
Rzip chunk:       2
Chunk byte width: 5
Chunk size:       11,038,101,504
Stream: 0
Offset: 3,711,779,396
Block   Comp    Percent        Comp Size /     UComp Size            Offset :           Head
1       lzma     40.9%         9,748,362 /     23,818,650     5,513,189,800 :              0
Stream: 1
Offset: 3,711,779,412
Block   Comp    Percent        Comp Size /     UComp Size            Offset :           Head
1       lzma     19.7%       209,608,924 /  1,063,272,448     3,711,779,428 :    209,608,972
2       lzma     13.8%       146,767,880 /  1,063,272,448     3,921,388,368 :    356,376,868
3       lzma     29.3%       311,235,309 /  1,063,272,448     4,068,156,264 :    667,612,193
4       lzma     43.9%       466,826,668 /  1,063,272,448     4,379,391,589 :  1,134,438,877
5       lzma     62.7%       666,971,511 /  1,063,272,448     4,846,218,273 :  1,811,158,782
6       lzma     66.3%       563,638,315 /    849,699,403     5,522,938,178 :              0
Rzip chunk:       3
Chunk byte width: 4
Chunk size:       8,700,442,624
Stream: 0
Offset: 6,086,576,515
Block   Comp    Percent        Comp Size /     UComp Size            Offset :           Head
1       lzma     49.4%         1,386,327 /      2,804,043     6,086,576,541 :              0
Stream: 1
Offset: 6,086,576,528
Block   Comp    Percent        Comp Size /     UComp Size            Offset :           Head
1       lzma     31.5%        21,110,119 /     67,003,107     6,087,962,881 :              0

Summary
=======
File: aosp-7.1.2.tar.8.9.lrz
lrzip-next version: 0.8 file

  Stats         Percent       Compressed /   Uncompressed
  -------------------------------------------------------
  Rzip:          66.8%    14,814,982,956 / 22,186,711,040
  Back end:      41.2%     6,109,072,587 / 14,814,982,956
  Overall:       27.5%     6,109,072,587 / 22,186,711,040

  Compression Method: rzip + lzma -- lc = 3, lp = 0, pb = 2, Dictionary Size = 100,663,296

  Decompressed file size: 22,186,711,040
  Compressed file size:    6,109,073,029
  Compression ratio:               3.632x

  MD5 Checksum: a132578f1cd19bc8342be5cc0daadd96

3 replies

giovariot Dec 28, 2021
Author

Thanks a lot for the very clear explanation. So if I use a >1TB swap partition it would be possible to dedupe the file in a single chunk? Would I need such a huge swap space for decompression too?

Thanks again

pete4abw Dec 28, 2021
Maintainer

So if I use a >1TB swap partition it would be possible to dedupe the file in a single chunk? Would I need such a huge swap space for decompression too?

I don't think that would accomplish what you want because the rzip dictionary size is not that large. I would not increase swap. lrzip/-next leverages the first pass for the benefit of the second, compression part. It provides better compression than the backends would alone and also operates faster because of the multi-threading. What I think you need is some dedicated hardware/software solution.

Read this page in the doc directory. It highlights the benefits of lrzip/-next. FOr the money (free) you likely won't find better. You can pipe to and from it, tar to and from it, run it on files directly, and choose levels for different purposes. And this lrzip-next is under continuous development and improvement.

Thank you again for your interest.

giovariot Dec 28, 2021
Author

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Going on with the discussion about long-distance redundancy removal from ckolivas/lrzip#214 #56

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Going on with the discussion about long-distance redundancy removal from ckolivas/lrzip#214 #56

giovariot Dec 28, 2021

Replies: 2 comments · 4 replies

pete4abw Dec 28, 2021 Maintainer

-R1

-R9

giovariot Dec 28, 2021 Author

pete4abw Dec 28, 2021 Maintainer

giovariot Dec 28, 2021 Author

pete4abw Dec 28, 2021 Maintainer

giovariot Dec 28, 2021 Author

giovariot
Dec 28, 2021

Replies: 2 comments 4 replies

pete4abw
Dec 28, 2021
Maintainer

giovariot Dec 28, 2021
Author

pete4abw
Dec 28, 2021
Maintainer

giovariot Dec 28, 2021
Author

pete4abw Dec 28, 2021
Maintainer

giovariot Dec 28, 2021
Author