-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-29917 Refactor compressToBuffer to support different compression methods #18009
HPCC-29917 Refactor compressToBuffer to support different compression methods #18009
Conversation
8949410
to
c25ff6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, and one issue to investigate
system/jlib/jlzw.cpp
Outdated
if ((method & ~COMPRESS_METHOD_AES) < COMPRESS_METHOD_LAST) | ||
{ | ||
if (method & COMPRESS_METHOD_AES) | ||
AESbyMethod[method & ~COMPRESS_METHOD_AES] = handler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= nullptr
and below.
void *newData = out.reserve(newSize); | ||
compressor->open(newData, newSize); | ||
if (compressor->write(src, len)==len) | ||
try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: Could use compressToBlock() if supported by the compressor. There is also a few bytes wasted, which would be more significant for small sizes, but that requires substantial refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a separate jira for this HPCC-30928 (and a branch gch/issue30928). It has two effects
i) it compresses better due to the daft way the current lz4 compress works.
ii) it is faster.
Before:
12:24:58.258158 419929 compressToBuffer 5 size 0 did not compress
12:24:58.258387 419929 compressToBuffer 5 size 102 compressed to 66 in 1985ns
12:24:58.258803 419929 compressToBuffer 5 size 1000 compressed to 53 in 1503ns
After:
12:25:21.154246 420137 compressToBuffer 5 size 0 did not compress
12:25:21.154313 420137 compressToBuffer 5 size 34 compressed to 32 in 2434ns
12:25:21.154783 420137 compressToBuffer 5 size 1000 compressed to 35 in 1082ns
We could remove 4 bytes from the size, but that would be trickier (the format wouldn't be compatible).
With that change, switching to a less repetitive text (Gen 1 from 0drvb10), LZW is better at compressing, but significantly slower:
12:37:32.829295 422255 compressToBuffer 2 size 512 compressed to 357 in 5490ns
12:37:32.837024 422255 compressToBuffer 2 size 1000 compressed to 627 in 9808ns
12:37:32.838347 422255 compressToBuffer 5 size 512 compressed to 411 in 1893ns
12:37:32.841931 422255 compressToBuffer 5 size 1000 compressed to 717 in 2885ns
testing/unittests/unittests.cpp
Outdated
testSome(0); | ||
testSome(1); | ||
testSome(16); | ||
testSome(32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding testSome(48) or 64 causes a realloc error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardkchapman - looks good, some trivial comments.
system/jlib/jfcmp.hpp
Outdated
if (max<1024) | ||
throw MakeStringException(-1,"CFcmpCompressor::open - block size (%d) not large enough", max); | ||
// if (max<1024) | ||
// throw MakeStringException(-1,"CFcmpCompressor::open - block size (%d) not large enough", max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to delete if no longer applicable (and similarly on line 106-7)
if (!handler) | ||
{ | ||
VStringBuffer s("Unknown compression method %x requested in compressToBuffer", (byte) method); | ||
throw makeStringException(0, s.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trivial: use makeStringExceptionV?
} | ||
catch (IException *E) | ||
{ | ||
E->Release(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there a debug log of the exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so - it will only happen for aes encryption of small buffers.
compressor->close(); | ||
size32_t compressedLen = compressor->buflen(); | ||
out.setWritePos(originalLength + sizeof(byte)); | ||
out.append(compressedLen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trivial: could use a DelayedMarker<size32_t> instead of line 777, and a marker.write(compressedLen) here
de1f5ba
to
35bab01
Compare
testing/unittests/unittests.cpp
Outdated
if (!prevResult) | ||
DBGLOG("compressToBuffer %x size %u compressed to %u", (byte) method, len, compressed.length()); | ||
ret = true; | ||
} | ||
CPPUNIT_ASSERT(compressed.length() <= len+5); | ||
MemoryBuffer out; | ||
decompressToBuffer(out, compressed, options); | ||
CPPUNIT_ASSERT(out.length() == len); | ||
CPPUNIT_ASSERT(memcmp(out.bytes(), in, len) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only check if len != 0, otherwise you fail a sanity check in debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardkchapman one problem with sanity check, but if that is fixed it looks ok to merge.
Please can you target it to 9.4.x. (I agree it is too painful to merge further back.)
system/jlib/jlz4.cpp
Outdated
@@ -73,7 +72,7 @@ class CLZ4Compressor final : public CFcmpCompressor | |||
if (toflush == 0) | |||
return; | |||
|
|||
if (toflush < 256) | |||
if (false && toflush < 256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should clean up and delete this code.
void *newData = out.reserve(newSize); | ||
compressor->open(newData, newSize); | ||
if (compressor->write(src, len)==len) | ||
try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a separate jira for this HPCC-30928 (and a branch gch/issue30928). It has two effects
i) it compresses better due to the daft way the current lz4 compress works.
ii) it is faster.
Before:
12:24:58.258158 419929 compressToBuffer 5 size 0 did not compress
12:24:58.258387 419929 compressToBuffer 5 size 102 compressed to 66 in 1985ns
12:24:58.258803 419929 compressToBuffer 5 size 1000 compressed to 53 in 1503ns
After:
12:25:21.154246 420137 compressToBuffer 5 size 0 did not compress
12:25:21.154313 420137 compressToBuffer 5 size 34 compressed to 32 in 2434ns
12:25:21.154783 420137 compressToBuffer 5 size 1000 compressed to 35 in 1082ns
We could remove 4 bytes from the size, but that would be trickier (the format wouldn't be compatible).
With that change, switching to a less repetitive text (Gen 1 from 0drvb10), LZW is better at compressing, but significantly slower:
12:37:32.829295 422255 compressToBuffer 2 size 512 compressed to 357 in 5490ns
12:37:32.837024 422255 compressToBuffer 2 size 1000 compressed to 627 in 9808ns
12:37:32.838347 422255 compressToBuffer 5 size 512 compressed to 411 in 1893ns
12:37:32.841931 422255 compressToBuffer 5 size 1000 compressed to 717 in 2885ns
} | ||
catch (IException *E) | ||
{ | ||
E->Release(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so - it will only happen for aes encryption of small buffers.
35bab01
to
053ac0b
Compare
053ac0b
to
1a2e915
Compare
@ghalliday I have fixed the sanitize check issue with len=0, removed disabled code, retargeted to 9.4.x and squashed. Please re-review and merge if accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this has created a regression. If you create workunits (and deploy to roxie) with the old code, and then read with the new code the system, crashes.
It is beause
void decompressToBuffer(MemoryBuffer & out, const void * src)
does not have a leading version - it is processed in decompressResource. I think the correct fix is to pass version (and len) from decompress resource into a shared function.
Tested with that change and it fixed the regression. The compressed file size varies slightly (due to the removal of the 256 limit). For the performance suite (admittedly not representative) the timings did not vary, sizes were as follows:
The first is slightly smaller. The second is very slightly bigger - which is interesting but insignificant. I would expect the files to be slightly smaller, but occasionally bigger because it will change the boundaries of the blocks of data being compressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardkchapman please squash.
… methods Also add some tests, and fix some issues that the tests showed up with small input sizes. Ensure back compatibility for LZW LE compression Signed-off-by: Richard Chapman <rchapman@hpccsystems.com>
2e6ca90
to
f59328b
Compare
@ghalliday Squashed |
Type of change:
Checklist:
Smoketest:
Testing: