Skip to content

[SPARK-51740][SQL] Allow get_json_object to consider leading spaces in paths #50533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR allows the get_json_object to consider leading whitespaces/tabs in paths. Earlier, these spaces would just get ignored as follows:

scala> spark.sql("""select get_json_object('{" a b c ": " leading space present", "a b c ": "leading space absent"}', "$[' a b c ']")""").show()
+------------------------------------------------------------------------------------------------------+
|get_json_object({" a b c ": " leading space present", "a b c ": "leading space absent"}, $[' a b c '])|
+------------------------------------------------------------------------------------------------------+
|                                                                                  leading space absent|
+------------------------------------------------------------------------------------------------------+

Why are the changes needed?

JSON keys could have leading spaces. However, there is currently no way to extract these keys using get_json_object. Also, it is possible for users to extract the wrong key if there is another nearly identical key without the leading whitespaces.

Does this PR introduce any user-facing change?

Yes. This is a behavioral change guarded by a feature flag.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 7, 2025
@HyukjinKwon HyukjinKwon changed the title [SPARK-51740] Allow get_json_object to consider leading spaces in paths [SPARK-51740][SQL] Allow get_json_object to consider leading spaces in paths Apr 8, 2025
Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db Thanks for finding and fixing this bug!

val GET_JSON_OBJECT_SKIP_LEADING_SPACES =
buildConf("spark.sql.getJsonObject.skipLeadingSpaces")
.internal()
.doc("When true, paths in the getJsonObject will skip leading spaces/tabs in the path.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a scenario where we would want to skip leading whitespace? Wouldn't that just be incorrect behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but considering this is a very old expression and this is a behavioral difference, some workloads might be based on the old behavior. I could remove the flag if accommodating it is not important.

@harshmotw-db harshmotw-db requested a review from gene-db April 18, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants