Are Apache Sedona geometry functions compatible with Spark Connect? #1764

barrieca · 2025-01-18T00:39:32Z

Hello.
We are trying to use geometry data with Apache Spark Connect and Apache Sedona. We are able to convert binary geometry data to Sedona geometry types using ST_GeomFromWKB on a local Apache Sedona instance, but when attempting to do this via a remote Spark Connect server, the ST_GeomFromWKB function is unable to be found (see below error). Are Sedona operations compatible with a Spark Connect server?

pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `ST_GeomFromWKB` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].; line 1 pos 0

Actual behavior

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from sedona.spark import *


spark = SparkSession.builder.remote("sc://<spark_connect_address>:<port>").getOrCreate()
url = "jdbc:postgresql://<database_address>"

sedona = SedonaContext.create(spark)
df = sedona.read.format("jdbc").option("url", url).option("dbtable", "nyc_neighborhoods").load().withColumn("geom", f.expr("ST_GeomFromWKB(geom)"))

df.show()

Running this code produces the above error at df.show(). When we use Sedona Spark in conjunction with our Spark Connect server without geospatial data (i.e., we don't use .withColumn("geom", f.expr("ST_GeomFromWKB(geom)"))), there is no error; the data is loaded and made available with the geom column in the original binary form.

Note: We are using the PostGIS demo database found here.

Steps to reproduce the problem

Start the Spark Connect server:

./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0,org.apache.sedona:sedona-spark-shaded-3.5_2.12:1.7.0,org.datasyslab:geotools-wrapper:1.7.0-28.5,org.postgresql:postgresql:42.7.4 --repositories https://artifacts.unidata.ucar.edu/repository/unidata-all --executor-memory 28G

2. Run the Python code above.

Settings

Sedona version = 1.7.0

Apache Spark version = 3.5.0

Scala version = 2.12

Python version = 3.8

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-18T00:39:59Z

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

jiayuasu · 2025-01-22T06:02:50Z

@barrieca Sedona should be able to work with spark-connect and we have tests for it.

barrieca · 2025-01-22T21:46:09Z

@jiayuasu, do you happen to see anything obviously incorrect about the way we are starting the connect server or with how we are connecting to it via Python? Our hunch is that perhaps the jars are not correctly being loaded when the server starts.

Additionally, would you mind pointing us to the tests?

jiayuasu · 2025-01-27T06:30:17Z

@barrieca In this commit, we added spark-connect support for Sedona DataFrame API: #1639

Kontinuation · 2025-02-04T09:18:47Z

You need to add additional configuration options when starting the Spark Connect server to load Sedona's Spark SQL extension:

./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0,org.apache.sedona:sedona-spark-shaded-3.5_2.12:1.7.0,org.datasyslab:geotools-wrapper:1.7.0-28.5,org.postgresql:postgresql:42.7.4 --repositories https://artifacts.unidata.ucar.edu/repository/unidata-all --executor-memory 28G \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.sedona.core.serde.SedonaKryoRegistrator \
--conf spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions

This will make ST_ functions available in Spark Connect sessions.

barrieca · 2025-02-04T22:24:23Z

Thanks, @Kontinuation! That worked!

As an aside, is there documentation for this somewhere that we missed?

jiayuasu · 2025-02-04T22:27:52Z

@barrieca I don't think this is documented as spark-connect is a pretty new feature. Maybe you can help us improve the documentation here: https://sedona.apache.org/latest/setup/cluster/

jbampton added help wanted docs labels Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are Apache Sedona geometry functions compatible with Spark Connect? #1764

Are Apache Sedona geometry functions compatible with Spark Connect? #1764

barrieca commented Jan 18, 2025

github-actions bot commented Jan 18, 2025

jiayuasu commented Jan 22, 2025

barrieca commented Jan 22, 2025

jiayuasu commented Jan 27, 2025

Kontinuation commented Feb 4, 2025

barrieca commented Feb 4, 2025

jiayuasu commented Feb 4, 2025

Are Apache Sedona geometry functions compatible with Spark Connect? #1764

Are Apache Sedona geometry functions compatible with Spark Connect? #1764

Comments

barrieca commented Jan 18, 2025

Actual behavior

Steps to reproduce the problem

Settings

github-actions bot commented Jan 18, 2025

jiayuasu commented Jan 22, 2025

barrieca commented Jan 22, 2025

jiayuasu commented Jan 27, 2025

Kontinuation commented Feb 4, 2025

barrieca commented Feb 4, 2025

jiayuasu commented Feb 4, 2025