[SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource #50531

wengh · 2025-04-07T18:49:47Z

What changes were proposed in this pull request?

Check the actual Arrow batch schema against the declared schema in MapInBatchEvaluator, throwing error if they don't match.

Also fix Pandas to Arrow conversion in ArrowStreamPandasUDFSerializer to respect nullability of output schema fields.

Why are the changes needed?

To improve error message and reject suspicious usage.

Does this PR introduce any user-facing change?

Yes.

Behaviour change

Some previously suspicious but accepted schema mismatches are now no longer valid.

This includes:

extraneous fields (previously ignored)
wrong order of fields of the same type (previously accepted but in wrong order)
expected non-nullable field is actually nullable (previously ignored)

Example:

from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.pandas.types import to_arrow_schema
import pyarrow as pa

expected = StructType.fromDDL("a int, b int")
actual = StructType.fromDDL("b int, a int")  # wrong order of fields

class TestDataSource(DataSource):
    def schema(self):
        return expected
    def reader(self, schema):
        return TestReader()

class TestReader(DataSourceReader):
    def read(self, partition):
        schema = to_arrow_schema(actual)
        yield pa.record_batch([[1], [2]], schema=schema)

spark.dataSource.register(TestDataSource)
spark.read.format("TestDataSource").load().show()

Before:

+---+---+
|  a|  b|
+---+---+
|  1|  2|
+---+---+

Now:

org.apache.spark.SparkException: [ARROW_TYPE_MISMATCH] Invalid schema from pandas_udf(): expected StructType(StructField(a,IntegerType,true),StructField(b,IntegerType,true)), got StructType(StructField(b,LongType,true),StructField(a,LongType,true)). SQLSTATE: 42K0G

For other schema mismatches, the error changed from internal error to a clearer ARROW_TYPE_MISMATCH error.

This includes

wrong field types
less than expected number of fields

Example:

from pyspark.sql.pandas.types import to_arrow_schema
from pyspark.sql.types import StructType, StructField, IntegerType
import pyarrow as pa

expected = StructType([StructField("a", IntegerType()), StructField("b", IntegerType())])
actual = StructType([StructField("a", IntegerType())])  # missing a column

def fun(iterator):
    for batch in iterator:
        schema = to_arrow_schema(actual)
        yield pa.record_batch([[1]], schema=schema)

spark.range(2).mapInArrow(fun, expected).show()

Before:

java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
    at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:134)
    at org.apache.spark.sql.execution.python.MapInBatchEvaluatorFactory$MapInBatchEvaluator.$anonfun$eval$3(MapInBatchEvaluatorFactory.scala:82)
    ...

Now:

org.apache.spark.SparkException: [ARROW_TYPE_MISMATCH] Invalid schema from pandas_udf(): expected StructType(StructField(a,IntegerType,true),StructField(b,IntegerType,true)), got StructType(StructField(a,IntegerType,true)). SQLSTATE: 42K0G

How was this patch tested?

End-to-end tests in python/pyspark/sql/tests/arrow/test_arrow_map.py

Was this patch authored or co-authored using generative AI tooling?

No

ueshin · 2025-04-07T23:23:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInBatchEvaluatorFactory.scala

@@ -75,6 +77,12 @@ class MapInBatchEvaluatorFactory(
      val unsafeProj = UnsafeProjection.create(output, output)

      columnarBatchIter.flatMap { batch =>
+        // Ensure the schema matches the expected schema
+        val actualSchema = batch.column(0).dataType()
+        if (!outputSchema.sameType(actualSchema)) {  // Ignore nullability mismatch for now


@HyukjinKwon Should ArrowEvalPythonExec also ignore nullability, and other similar Exec if any? Or this also should NOT ignore it?

ah, but if this doesn't ignore nullability, it could introduce a breaking change?

Hmmm .. yeah .. maybe let's just ignore it for now ..

Yeah, not ignoring nullability breaks many of existing tests (e.g. where expected field is not nullable but actual field is nullable but there's no null values, so the actual schema is technically wrong but not causing any problems)

ueshin

Otherwise, LGTM.

HyukjinKwon · 2025-04-08T01:15:15Z

test failure seems unrelated but mind triggering again to make sure?

zhengruifeng

qq: do we have corresponding validation in the python side?

wengh · 2025-04-08T17:05:05Z

do we have corresponding validation in the python side?

@zhengruifeng
No. DataSource only checks that top level columns have matching names. mapInArrow & mapInPandas don't check at all.

wengh · 2025-04-09T22:25:00Z

python/pyspark/sql/pandas/serializers.py

@@ -518,8 +528,7 @@ def _create_struct_array(self, df, arrow_struct_type, spark_type=None):
                for i, field in enumerate(arrow_struct_type)
            ]

-        struct_names = [field.name for field in arrow_struct_type]
-        return pa.StructArray.from_arrays(struct_arrs, struct_names)
+        return pa.StructArray.from_arrays(struct_arrs, fields=list(arrow_struct_type))


correctly handle non nullable fields required by the arrow_struct_type schema

HyukjinKwon · 2025-04-13T22:51:01Z

Merged to master.

allisonwang-db · 2025-04-25T17:43:00Z

Thanks for the fix! But this is a breaking change. Can we document this in the migration guide?

wengh · 2025-04-25T22:11:16Z

Thanks for the fix! But this is a breaking change. Can we document this in the migration guide?

@allisonwang-db #50722

validate type in MapInBatchEvaluatorFactory

855a891

github-actions bot added SQL PYTHON labels Apr 7, 2025

ignore nullability

318d7ed

wengh changed the title ~~[WIP][PYTHON] Validate type in MapInBatchEvaluatorFactory~~ [WIP][SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource Apr 7, 2025

wengh added 2 commits April 7, 2025 14:54

comments

88df976

change tests to use mapInArrow

c7150a2

wengh changed the title ~~[WIP][SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource~~ [SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource Apr 7, 2025

wengh marked this pull request as ready for review April 7, 2025 22:02

show PythonEvalType in error

55b6324

HyukjinKwon approved these changes Apr 7, 2025

View reviewed changes

ueshin reviewed Apr 7, 2025

View reviewed changes

ueshin approved these changes Apr 7, 2025

View reviewed changes

zhengruifeng approved these changes Apr 8, 2025

View reviewed changes

wengh added 2 commits April 8, 2025 11:42

check nullability and case

8f4c0d4

add flag

12b8fd0

wengh force-pushed the validate-arrow-type branch from bf9e707 to 12b8fd0 Compare April 9, 2025 00:14

respect nullability when converting pandas to arrow

871b56c

wengh requested review from ueshin, HyukjinKwon and zhengruifeng April 9, 2025 21:58

wengh commented Apr 9, 2025

View reviewed changes

wengh added 2 commits April 10, 2025 08:03

add comments

82e795b

fix

57f9225

HyukjinKwon approved these changes Apr 11, 2025

View reviewed changes

ueshin approved these changes Apr 12, 2025

View reviewed changes

HyukjinKwon closed this in 919e60f Apr 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource #50531

[SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource #50531

wengh commented Apr 7, 2025 •

edited

Loading

ueshin Apr 7, 2025

ueshin Apr 7, 2025 •

edited

Loading

HyukjinKwon Apr 8, 2025

wengh Apr 8, 2025 •

edited

Loading

ueshin left a comment

HyukjinKwon commented Apr 8, 2025

zhengruifeng left a comment

wengh commented Apr 8, 2025 •

edited

Loading

wengh Apr 9, 2025

HyukjinKwon commented Apr 13, 2025

allisonwang-db commented Apr 25, 2025

wengh commented Apr 25, 2025

[SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource #50531

[SPARK-51739][PYTHON] Validate Arrow schema from mapInArrow & mapInPandas & DataSource #50531

Conversation

wengh commented Apr 7, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Behaviour change

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

ueshin Apr 7, 2025

Choose a reason for hiding this comment

ueshin Apr 7, 2025 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Apr 8, 2025

Choose a reason for hiding this comment

wengh Apr 8, 2025 • edited Loading

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 8, 2025

zhengruifeng left a comment

Choose a reason for hiding this comment

wengh commented Apr 8, 2025 • edited Loading

wengh Apr 9, 2025

Choose a reason for hiding this comment

HyukjinKwon commented Apr 13, 2025

allisonwang-db commented Apr 25, 2025

wengh commented Apr 25, 2025

wengh commented Apr 7, 2025 •

edited

Loading

ueshin Apr 7, 2025 •

edited

Loading

wengh Apr 8, 2025 •

edited

Loading

wengh commented Apr 8, 2025 •

edited

Loading