-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Enable the capability to specify zstd and lz4 segment compression via config #14008
Changes from all commits
fcb896c
045130e
0f1fd0c
1022788
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,6 +69,12 @@ public class TarCompressionUtils { | |
private TarCompressionUtils() { | ||
} | ||
|
||
/** | ||
* This generic compressed tar file extension does not bind to a particular compressor. Decompression determines the | ||
* appropriate compressor at run-time based on the file's magic number irrespective of the file extension. | ||
* Compression uses the default compressor automatically if this generic extension is used. | ||
*/ | ||
public static final String TAR_COMPRESSED_FILE_EXTENSION = ".tar.compressed"; | ||
public static final String TAR_GZ_FILE_EXTENSION = ".tar.gz"; | ||
public static final String TAR_LZ4_FILE_EXTENSION = ".tar.lz4"; | ||
public static final String TAR_ZST_FILE_EXTENSION = ".tar.zst"; | ||
|
@@ -77,6 +83,13 @@ private TarCompressionUtils() { | |
CompressorStreamFactory.LZ4_FRAMED, TAR_ZST_FILE_EXTENSION, CompressorStreamFactory.ZSTANDARD); | ||
private static final CompressorStreamFactory COMPRESSOR_STREAM_FACTORY = CompressorStreamFactory.getSingleton(); | ||
private static final char ENTRY_NAME_SEPARATOR = '/'; | ||
private static String _defaultCompressorName = CompressorStreamFactory.GZIP; | ||
|
||
public static void setDefaultCompressor(String compressorName) { | ||
if (COMPRESSOR_NAME_BY_FILE_EXTENSIONS.containsKey(compressorName)) { | ||
_defaultCompressorName = compressorName; | ||
} | ||
} | ||
|
||
/** | ||
* Creates a compressed tar file from the input file/directory to the output file. The output file must have | ||
|
@@ -93,15 +106,29 @@ public static void createCompressedTarFile(File inputFile, File outputFile) | |
*/ | ||
public static void createCompressedTarFile(File[] inputFiles, File outputFile) | ||
throws IOException { | ||
String compressorName = null; | ||
for (String supportedCompressorExtension : COMPRESSOR_NAME_BY_FILE_EXTENSIONS.keySet()) { | ||
if (outputFile.getName().endsWith(supportedCompressorExtension)) { | ||
compressorName = COMPRESSOR_NAME_BY_FILE_EXTENSIONS.get(supportedCompressorExtension); | ||
break; | ||
if (outputFile.getName().endsWith(TAR_COMPRESSED_FILE_EXTENSION)) { | ||
createCompressedTarFile(inputFiles, outputFile, _defaultCompressorName); | ||
} else { | ||
String compressorName = null; | ||
for (String supportedCompressorExtension : COMPRESSOR_NAME_BY_FILE_EXTENSIONS.keySet()) { | ||
if (outputFile.getName().endsWith(supportedCompressorExtension)) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can outputFile endswith ".tar.compressed" which is not a supported compressor file extension? |
||
compressorName = COMPRESSOR_NAME_BY_FILE_EXTENSIONS.get(supportedCompressorExtension); | ||
createCompressedTarFile(inputFiles, outputFile, compressorName); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can move common code createCompressedTarFile(inputFiles, outputFile, compressorName) after precondition check. |
||
return; | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. break in if loop? |
||
} | ||
Preconditions.checkState(null != compressorName, | ||
"Output file: %s does not have a supported compressed tar file extension", outputFile); | ||
} | ||
Preconditions.checkState(null != compressorName, | ||
"Output file: %s does not have a supported compressed tar file extension", outputFile); | ||
} | ||
|
||
public static void createCompressedTarFile(File inputFile, File outputFile, String compressorName) | ||
throws IOException { | ||
createCompressedTarFile(new File[]{inputFile}, outputFile, compressorName); | ||
} | ||
|
||
public static void createCompressedTarFile(File[] inputFiles, File outputFile, String compressorName) | ||
throws IOException { | ||
try (OutputStream fileOut = Files.newOutputStream(outputFile.toPath()); | ||
BufferedOutputStream bufferedOut = new BufferedOutputStream(fileOut); | ||
OutputStream compressorOut = COMPRESSOR_STREAM_FACTORY.createCompressorOutputStream(compressorName, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly good. we are able to compress and decompress different segment format (tar.gz, tar.zst, tar.compressed) even they appear in one table.
Curious, if we exposed and updated the value of _defaultCompressorName (line 86), how can we make sure the
.tar.compressed
files can still be decompress by updated compressor?in another word, do we have test covered the scenarios for changing the default compressor and make sure the existing segments with (.tar.compressed) can be decompressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit test is here in Apache commons library. You can name the file extension to whatever you want, such as
.tar.deemoliu
and you'd still be able to decompress the segment. Decompression does not rely on the file extension to figure out the compressor to use for decompression.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, so we using the first bytes to identify which decompressor to use. the apache library initializes the correct compressor for us.