-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC 30381 Parquet plugin function names and member variables should be more consistent. #18024
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor changes, all in comments. Looks good!
plugins/parquet/parquetembed.cpp
Outdated
* | ||
* @param option The read or write option as well as information about partitioning. | ||
* @param _location The location to read a parquet file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it read, "The full path from which to read a Parquet file"?
Additionally: Recommend capitalizing "Parquet" in all comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalized Parquet and reworded to clarify it is a full path to the target.
plugins/parquet/parquetembed.cpp
Outdated
* | ||
* @param option The read or write option as well as information about partitioning. | ||
* @param _location The location to read a parquet file. | ||
* @param _batchSize The size of the batches when converting parquet columns to rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What scale is this? Bytes? Megabytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batchSize is the number of rows in the RecordBatch. Changed comment to be more descriptive.
plugins/parquet/parquetembed.cpp
Outdated
} | ||
|
||
/** | ||
* @brief Contructs a ParquetWriter for the target location and checks for existing data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: 'Contructs'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
plugins/parquet/parquetembed.cpp
Outdated
* @brief Contructs a ParquetWriter for the target location and checks for existing data. | ||
* | ||
* @param option The read or write option as well as information about partitioning. | ||
* @param _destination The destination to write a parquet file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a full path, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to specify this is the full path.
plugins/parquet/parquetembed.cpp
Outdated
* | ||
* @param option The read or write option as well as information about partitioning. | ||
* @param _destination The destination to write a parquet file. | ||
* @param _rowSize The max row group size when creating RecordBatches for output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the scale? Bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded to clarify that rowSize is the maximum number of rows in a RecordBatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest _maxRowCountInBatch or something along those lines as a variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to maxRowCountInBatch. That is much more readable and causes less confusion.
plugins/parquet/parquetembed.cpp
Outdated
} | ||
|
||
std::unordered_map<std::string, std::shared_ptr<arrow::Array>> &ParquetHelper::next() | ||
/** | ||
* @brief convert a vector of rapidjson::Documents containing single rows to an arrow::RecordBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Capital 'convert'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
plugins/parquet/parquetembed.cpp
Outdated
__int64 rowSize = 40000; // Size of the row groups when writing to parquet files | ||
__int64 batchSize = 40000; // Size of the batches when converting parquet columns to rows | ||
bool overwrite = false; // If true overwrite file with no error. The default is false and will throw an error if the file already exists. | ||
arrow::Compression::type compressionOption = arrow::Compression::UNCOMPRESSED; // Compression option that supports all arrow compression types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is a bit confusing. I think it is referring to 'what is compressionOption' but it somehow makes me think it is describing arrow::Compression::UNCOMPRESSED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded description of compressionOption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few very minor changes. Looking really good!
plugins/parquet/parquetembed.cpp
Outdated
* | ||
* @param option The read or write option as well as information about partitioning. | ||
* @param _destination The destination to write a parquet file. | ||
* @param _rowSize The max row group size when creating RecordBatches for output. | ||
* @param _destination The full path for which to write a Parquet file or partitioned dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivial: 'full path for which' -> 'full path to which'
Also: If the path can represent either a file or a directory, depending on the Parquet file type, then you might want to note that. (virtually copy this comment to other 'path' documentation lines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added better descriptions to location and destination comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Please squash.
@dcamper Squashed. |
@ghalliday This is ready to merge. |
@jackdelv I should have checked. For the future, the format of the commit should be "HPCC-NNNN" rather than "HPCC NNNN". |
Type of change:
Checklist:
Smoketest:
Testing: