HPCC 30381 Parquet plugin function names and member variables should be more consistent. #18024

jackdelv · 2023-11-13T20:11:08Z

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

dcamper

A few minor changes, all in comments. Looks good!

dcamper · 2023-11-14T14:53:13Z

plugins/parquet/parquetembed.cpp

 *
+ * @param option The read or write option as well as information about partitioning.
+ * @param _location The location to read a parquet file.


Should it read, "The full path from which to read a Parquet file"?

Additionally: Recommend capitalizing "Parquet" in all comments.

Capitalized Parquet and reworded to clarify it is a full path to the target.

dcamper · 2023-11-14T14:54:51Z

plugins/parquet/parquetembed.cpp

 *
+ * @param option The read or write option as well as information about partitioning.
+ * @param _location The location to read a parquet file.
 * @param _batchSize The size of the batches when converting parquet columns to rows.


What scale is this? Bytes? Megabytes?

batchSize is the number of rows in the RecordBatch. Changed comment to be more descriptive.

dcamper · 2023-11-14T14:56:56Z

plugins/parquet/parquetembed.cpp

+}
+
+/**
+ * @brief Contructs a ParquetWriter for the target location and checks for existing data.


Typo: 'Contructs'

dcamper · 2023-11-14T15:02:13Z

plugins/parquet/parquetembed.cpp

+ * @brief Contructs a ParquetWriter for the target location and checks for existing data.
+ *
+ * @param option The read or write option as well as information about partitioning.
+ * @param _destination The destination to write a parquet file.


That is a full path, correct?

Changed to specify this is the full path.

dcamper · 2023-11-14T15:02:33Z

plugins/parquet/parquetembed.cpp

+ *
+ * @param option The read or write option as well as information about partitioning.
+ * @param _destination The destination to write a parquet file.
+ * @param _rowSize The max row group size when creating RecordBatches for output.


What is the scale? Bytes?

Reworded to clarify that rowSize is the maximum number of rows in a RecordBatch.

May I suggest _maxRowCountInBatch or something along those lines as a variable name?

Changed to maxRowCountInBatch. That is much more readable and causes less confusion.

dcamper · 2023-11-14T15:05:24Z

plugins/parquet/parquetembed.cpp

 }

-std::unordered_map<std::string, std::shared_ptr<arrow::Array>> &ParquetHelper::next()
+/**
+ * @brief convert a vector of rapidjson::Documents containing single rows to an arrow::RecordBatch


Minor: Capital 'convert'

dcamper · 2023-11-14T15:08:13Z

plugins/parquet/parquetembed.cpp

+    __int64 rowSize = 40000;            // Size of the row groups when writing to parquet files
+    __int64 batchSize = 40000;          // Size of the batches when converting parquet columns to rows
+    bool overwrite = false;             // If true overwrite file with no error. The default is false and will throw an error if the file already exists.
+    arrow::Compression::type compressionOption = arrow::Compression::UNCOMPRESSED; // Compression option that supports all arrow compression types.


The comment is a bit confusing. I think it is referring to 'what is compressionOption' but it somehow makes me think it is describing arrow::Compression::UNCOMPRESSED

Reworded description of compressionOption.

dcamper

A few very minor changes. Looking really good!

dcamper · 2023-11-28T15:04:51Z

plugins/parquet/parquetembed.cpp

 *
 * @param option The read or write option as well as information about partitioning.
- * @param _destination The destination to write a parquet file.
- * @param _rowSize The max row group size when creating RecordBatches for output.
+ * @param _destination The full path for which to write a Parquet file or partitioned dataset.


Trivial: 'full path for which' -> 'full path to which'

Also: If the path can represent either a file or a directory, depending on the Parquet file type, then you might want to note that. (virtually copy this comment to other 'path' documentation lines).

Added better descriptions to location and destination comments.

plugins/parquet/parquetembed.cpp

dcamper

Looks good! Please squash.

jackdelv · 2023-11-28T18:21:03Z

@dcamper Squashed.

jackdelv · 2023-11-29T12:35:02Z

@ghalliday This is ready to merge.

ghalliday · 2023-11-29T14:05:06Z

@jackdelv I should have checked. For the future, the format of the commit should be "HPCC-NNNN" rather than "HPCC NNNN".

jackdelv requested a review from dcamper November 13, 2023 20:12

dcamper requested changes Nov 14, 2023

View reviewed changes

jackdelv requested a review from dcamper November 28, 2023 15:01

dcamper requested changes Nov 28, 2023

View reviewed changes

jackdelv requested a review from dcamper November 28, 2023 18:13

dcamper approved these changes Nov 28, 2023

View reviewed changes

HPCC 30381 Parquet plugin functions should be more consistent.

72225e2

jackdelv force-pushed the HPCC-30381 branch from 219b488 to 72225e2 Compare November 28, 2023 18:20

ghalliday merged commit 8c6bfc2 into hpcc-systems:candidate-9.4.x Nov 29, 2023
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC 30381 Parquet plugin function names and member variables should be more consistent. #18024

HPCC 30381 Parquet plugin function names and member variables should be more consistent. #18024

jackdelv commented Nov 13, 2023 •

edited

Loading

dcamper left a comment

dcamper Nov 14, 2023

jackdelv Nov 14, 2023

dcamper Nov 14, 2023

jackdelv Nov 14, 2023

dcamper Nov 14, 2023

jackdelv Nov 14, 2023

dcamper Nov 14, 2023

jackdelv Nov 14, 2023

dcamper Nov 14, 2023

jackdelv Nov 14, 2023

dcamper Nov 14, 2023 •

edited

Loading

jackdelv Nov 14, 2023

dcamper Nov 14, 2023

jackdelv Nov 14, 2023

dcamper Nov 14, 2023

jackdelv Nov 14, 2023

dcamper left a comment

dcamper Nov 28, 2023

jackdelv Nov 28, 2023

dcamper left a comment

jackdelv commented Nov 28, 2023

jackdelv commented Nov 29, 2023

ghalliday commented Nov 29, 2023

HPCC 30381 Parquet plugin function names and member variables should be more consistent. #18024

HPCC 30381 Parquet plugin function names and member variables should be more consistent. #18024

Conversation

jackdelv commented Nov 13, 2023 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

dcamper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcamper Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcamper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcamper left a comment

Choose a reason for hiding this comment

jackdelv commented Nov 28, 2023

jackdelv commented Nov 29, 2023

ghalliday commented Nov 29, 2023

jackdelv commented Nov 13, 2023 •

edited

Loading

dcamper Nov 14, 2023 •

edited

Loading