A simple Spark benchmark #8813
philerooski
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I did a sort of naive benchmark of GE running in an AWS Glue Spark environment using the
RuntimeDataConnector
andRuntimeBatchRequest
APIs. I won't go into too much implementation detail (it would probably only complicate things), but I noticed that GE was slightly but not significantly slower than PySpark when doing a simple test: checking a single field for null values.Effectively, I had PySpark doing something like this:
And GE was validating an expectation suite containing the expectation
against a
pyspark.sql.Dataframe
(e.g., passing aruntime_parameters={"batch_data": spark_df}
to theRuntimeBatchRequest
.GE was ~25-100% slower than PySpark for Parquet datasets which had an on-disk size of ~5-15 GB. I used log messages to verify that I was only counting time during the actual null-value check, and not counting any setup time. I thought that GE might be slower because it is loading the entire row into memory, whereas PySpark is operating on a single column. But this is little more than a guess.
I would love to hear from someone more familiar with GE if that is the case and whether there may be other factors I'm not considering as to why GE may be slower than PySpark in this example case.
Beta Was this translation helpful? Give feedback.
All reactions