Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-29917 Refactor compressToBuffer to support different compression methods #18009

Merged

Conversation

richardkchapman
Copy link
Member

@richardkchapman richardkchapman commented Nov 8, 2023

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@richardkchapman richardkchapman force-pushed the compressToBuffer branch 2 times, most recently from 8949410 to c25ff6e Compare November 15, 2023 15:04
@richardkchapman richardkchapman changed the title Compress to buffer HPCC-29917 Refactor compressToBuffer to support different compression methods Nov 15, 2023
@richardkchapman richardkchapman marked this pull request as ready for review November 15, 2023 15:06
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, and one issue to investigate

if ((method & ~COMPRESS_METHOD_AES) < COMPRESS_METHOD_LAST)
{
if (method & COMPRESS_METHOD_AES)
AESbyMethod[method & ~COMPRESS_METHOD_AES] = handler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= nullptr
and below.

void *newData = out.reserve(newSize);
compressor->open(newData, newSize);
if (compressor->write(src, len)==len)
try
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future: Could use compressToBlock() if supported by the compressor. There is also a few bytes wasted, which would be more significant for small sizes, but that requires substantial refactoring.

Copy link
Member

@ghalliday ghalliday Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a separate jira for this HPCC-30928 (and a branch gch/issue30928). It has two effects
i) it compresses better due to the daft way the current lz4 compress works.
ii) it is faster.
Before:

12:24:58.258158 419929 compressToBuffer 5 size 0 did not compress
12:24:58.258387 419929 compressToBuffer 5 size 102 compressed to 66 in 1985ns
12:24:58.258803 419929 compressToBuffer 5 size 1000 compressed to 53 in 1503ns

After:

12:25:21.154246 420137 compressToBuffer 5 size 0 did not compress
12:25:21.154313 420137 compressToBuffer 5 size 34 compressed to 32 in 2434ns
12:25:21.154783 420137 compressToBuffer 5 size 1000 compressed to 35 in 1082ns

We could remove 4 bytes from the size, but that would be trickier (the format wouldn't be compatible).

With that change, switching to a less repetitive text (Gen 1 from 0drvb10), LZW is better at compressing, but significantly slower:

12:37:32.829295 422255 compressToBuffer 2 size 512 compressed to 357 in 5490ns
12:37:32.837024 422255 compressToBuffer 2 size 1000 compressed to 627 in 9808ns
12:37:32.838347 422255 compressToBuffer 5 size 512 compressed to 411 in 1893ns
12:37:32.841931 422255 compressToBuffer 5 size 1000 compressed to 717 in 2885ns

testSome(0);
testSome(1);
testSome(16);
testSome(32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding testSome(48) or 64 causes a realloc error.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardkchapman - looks good, some trivial comments.

if (max<1024)
throw MakeStringException(-1,"CFcmpCompressor::open - block size (%d) not large enough", max);
// if (max<1024)
// throw MakeStringException(-1,"CFcmpCompressor::open - block size (%d) not large enough", max);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to delete if no longer applicable (and similarly on line 106-7)

if (!handler)
{
VStringBuffer s("Unknown compression method %x requested in compressToBuffer", (byte) method);
throw makeStringException(0, s.str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: use makeStringExceptionV?

}
catch (IException *E)
{
E->Release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there a debug log of the exception here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - it will only happen for aes encryption of small buffers.

compressor->close();
size32_t compressedLen = compressor->buflen();
out.setWritePos(originalLength + sizeof(byte));
out.append(compressedLen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: could use a DelayedMarker<size32_t> instead of line 777, and a marker.write(compressedLen) here

if (!prevResult)
DBGLOG("compressToBuffer %x size %u compressed to %u", (byte) method, len, compressed.length());
ret = true;
}
CPPUNIT_ASSERT(compressed.length() <= len+5);
MemoryBuffer out;
decompressToBuffer(out, compressed, options);
CPPUNIT_ASSERT(out.length() == len);
CPPUNIT_ASSERT(memcmp(out.bytes(), in, len) == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only check if len != 0, otherwise you fail a sanity check in debug

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardkchapman one problem with sanity check, but if that is fixed it looks ok to merge.
Please can you target it to 9.4.x. (I agree it is too painful to merge further back.)

@@ -73,7 +72,7 @@ class CLZ4Compressor final : public CFcmpCompressor
if (toflush == 0)
return;

if (toflush < 256)
if (false && toflush < 256)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should clean up and delete this code.

void *newData = out.reserve(newSize);
compressor->open(newData, newSize);
if (compressor->write(src, len)==len)
try
Copy link
Member

@ghalliday ghalliday Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a separate jira for this HPCC-30928 (and a branch gch/issue30928). It has two effects
i) it compresses better due to the daft way the current lz4 compress works.
ii) it is faster.
Before:

12:24:58.258158 419929 compressToBuffer 5 size 0 did not compress
12:24:58.258387 419929 compressToBuffer 5 size 102 compressed to 66 in 1985ns
12:24:58.258803 419929 compressToBuffer 5 size 1000 compressed to 53 in 1503ns

After:

12:25:21.154246 420137 compressToBuffer 5 size 0 did not compress
12:25:21.154313 420137 compressToBuffer 5 size 34 compressed to 32 in 2434ns
12:25:21.154783 420137 compressToBuffer 5 size 1000 compressed to 35 in 1082ns

We could remove 4 bytes from the size, but that would be trickier (the format wouldn't be compatible).

With that change, switching to a less repetitive text (Gen 1 from 0drvb10), LZW is better at compressing, but significantly slower:

12:37:32.829295 422255 compressToBuffer 2 size 512 compressed to 357 in 5490ns
12:37:32.837024 422255 compressToBuffer 2 size 1000 compressed to 627 in 9808ns
12:37:32.838347 422255 compressToBuffer 5 size 512 compressed to 411 in 1893ns
12:37:32.841931 422255 compressToBuffer 5 size 1000 compressed to 717 in 2885ns

}
catch (IException *E)
{
E->Release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - it will only happen for aes encryption of small buffers.

@richardkchapman richardkchapman changed the base branch from master to candidate-9.4.x December 4, 2023 11:09
@richardkchapman
Copy link
Member Author

@ghalliday I have fixed the sanitize check issue with len=0, removed disabled code, retargeted to 9.4.x and squashed. Please re-review and merge if accepted.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this has created a regression. If you create workunits (and deploy to roxie) with the old code, and then read with the new code the system, crashes.

It is beause
void decompressToBuffer(MemoryBuffer & out, const void * src)

does not have a leading version - it is processed in decompressResource. I think the correct fix is to pass version (and len) from decompress resource into a shared function.

@ghalliday
Copy link
Member

Tested with that change and it fixed the regression. The compressed file size varies slightly (due to the removal of the 256 limit). For the performance suite (admittedly not representative) the timings did not vary, sizes were as follows:

Before:
-rw-r--r-- 1 gavin gavin  4120023230 Dec  4 12:13 padded_compressed._1_of_1
-rw-r--r-- 1 gavin gavin  3650671057 Dec  4 12:13 simple_compressed._1_of_1
After:
-rw-r--r-- 1 gavin gavin  4119595073 Dec  4 17:02 padded_compressed._1_of_1
-rw-r--r-- 1 gavin gavin  3650689570 Dec  4 17:02 simple_compressed._1_of_1

The first is slightly smaller. The second is very slightly bigger - which is interesting but insignificant. I would expect the files to be slightly smaller, but occasionally bigger because it will change the boundaries of the blocks of data being compressed.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardkchapman please squash.

… methods

Also add some tests, and fix some issues that the tests showed up with small
input sizes.

Ensure back compatibility for LZW LE compression

Signed-off-by: Richard Chapman <rchapman@hpccsystems.com>
@richardkchapman
Copy link
Member Author

@ghalliday Squashed

@ghalliday ghalliday merged commit 08425ea into hpcc-systems:candidate-9.4.x Dec 15, 2023
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants