A simple Spark benchmark #8813

philerooski · 2023-10-10T02:10:34Z

philerooski
Oct 10, 2023

I did a sort of naive benchmark of GE running in an AWS Glue Spark environment using the RuntimeDataConnector and RuntimeBatchRequest APIs. I won't go into too much implementation detail (it would probably only complicate things), but I noticed that GE was slightly but not significantly slower than PySpark when doing a simple test: checking a single field for null values.

Effectively, I had PySpark doing something like this:

null_rows = spark_df.myFieldName.isNull()
filtered_results = spark_df.filter(null_rows)
result = filtered_results.collect()

And GE was validating an expectation suite containing the expectation

ExpectationConfiguration(
        expectation_type="expect_column_values_to_not_be_null",
        kwargs={"column": "myFieldName"}
)

against a pyspark.sql.Dataframe (e.g., passing a runtime_parameters={"batch_data": spark_df} to the RuntimeBatchRequest.

GE was ~25-100% slower than PySpark for Parquet datasets which had an on-disk size of ~5-15 GB. I used log messages to verify that I was only counting time during the actual null-value check, and not counting any setup time. I thought that GE might be slower because it is loading the entire row into memory, whereas PySpark is operating on a single column. But this is little more than a guess.

I would love to hear from someone more familiar with GE if that is the case and whether there may be other factors I'm not considering as to why GE may be slower than PySpark in this example case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A simple Spark benchmark #8813

{{title}}

Replies: 0 comments

Select a reply

A simple Spark benchmark #8813

philerooski Oct 10, 2023

Replies: 0 comments

philerooski
Oct 10, 2023