-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource #50531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -75,6 +77,12 @@ class MapInBatchEvaluatorFactory( | |||
val unsafeProj = UnsafeProjection.create(output, output) | |||
|
|||
columnarBatchIter.flatMap { batch => | |||
// Ensure the schema matches the expected schema | |||
val actualSchema = batch.column(0).dataType() | |||
if (!outputSchema.sameType(actualSchema)) { // Ignore nullability mismatch for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Should ArrowEvalPythonExec
also ignore nullability, and other similar Exec
if any? Or this also should NOT ignore it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, but if this doesn't ignore nullability, it could introduce a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm .. yeah .. maybe let's just ignore it for now ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not ignoring nullability breaks many of existing tests (e.g. where expected field is not nullable but actual field is nullable but there's no null values, so the actual schema is technically wrong but not causing any problems)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
test failure seems unrelated but mind triggering again to make sure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: do we have corresponding validation in the python side?
@zhengruifeng |
bf9e707
to
12b8fd0
Compare
@@ -518,8 +528,7 @@ def _create_struct_array(self, df, arrow_struct_type, spark_type=None): | |||
for i, field in enumerate(arrow_struct_type) | |||
] | |||
|
|||
struct_names = [field.name for field in arrow_struct_type] | |||
return pa.StructArray.from_arrays(struct_arrs, struct_names) | |||
return pa.StructArray.from_arrays(struct_arrs, fields=list(arrow_struct_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correctly handle non nullable fields required by the arrow_struct_type schema
Merged to master. |
Thanks for the fix! But this is a breaking change. Can we document this in the migration guide? |
|
What changes were proposed in this pull request?
Check the actual Arrow batch schema against the declared schema in
MapInBatchEvaluator
, throwing error if they don't match.Also fix Pandas to Arrow conversion in
ArrowStreamPandasUDFSerializer
to respect nullability of output schema fields.Why are the changes needed?
To improve error message and reject suspicious usage.
Does this PR introduce any user-facing change?
Yes.
Behaviour change
Some previously suspicious but accepted schema mismatches are now no longer valid.
This includes:
Example:
Before:
Now:
For other schema mismatches, the error changed from internal error to a clearer
ARROW_TYPE_MISMATCH
error.This includes
Example:
Before:
Now:
How was this patch tested?
End-to-end tests in
python/pyspark/sql/tests/arrow/test_arrow_map.py
Was this patch authored or co-authored using generative AI tooling?
No