Skip to content

[SPARK-51752][SQL] Enable rCTE referencing from within a CTE #50546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Pajaraja
Copy link
Contributor

@Pajaraja Pajaraja commented Apr 9, 2025

What changes were proposed in this pull request?

Enabling the possibility of a CTE referencing the recursive CTE it is inside of. This is done by modifying the CTESubstitution file, consisting of two main parts:

  • If traverseAndSubstituteCTE is called from resolveCTERelations when attempting to resolve a recursive CTE to resolve all the CTEs it references, we remember this ancestor rCTE in case any of the child CTEs want to reference it. If we encounter another rCTE inside of the rCTE (which is only allowed in the anchor), we define it to be the new anchor rCTE.
  • Even though the first part is enough to resolve these CTEs, a new problem arises when trying to identify whether a CTE is recursive or not, since if CTE0 is recursive and CTE1 is a CTE inside CTE0 that references CTE0, the only way to tell whether CTE0 is recursive is to check inside CTE1. For this reason we do in-place CTESubstitution inside of recursive CTEs
  • Additionally we modify the rules to identify cases with in-place CTE substitutions in ResolveWithCTE.

Why are the changes needed?

To make queries that self reference work. An example of such a query is:

WITH RECURSIVE t1 AS (
  SELECT 1 AS n
  UNION ALL
  WITH t2 AS (SELECT n + 1 FROM t1 WHERE n < 5)
  SELECT * FROM t2
) SELECT * FROM t1;

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing golden file tests for this that didn't work before.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 9, 2025
@@ -83,8 +83,21 @@ object ResolveWithCTE extends Rule[LogicalPlan] {
cteDef.copy(child = alias.copy(child = loop))
}

// Simple case of duplicating (UNION ALL) clause, with CTEs inside.
case alias @ SubqueryAlias(_, withCTE @ WithCTE(
Union(Seq(anchor, recursion), false, false), _)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't care about the query pattern in CTE relations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we do, yes!
Since these CTEs refer to the whole inside of the initial CTE an can be called from the recursion we need to add logic to replace CteRelationRef with UnionLoopRef inside of these CTEs (and not just inside of the recursion). I've changed this now and added some tests to test it.

Pajaraja and others added 4 commits April 11, 2025 14:34
…ysis/ResolveWithCTE.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
…ysis/CTESubstitution.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
@@ -255,6 +273,21 @@ object CTESubstitution extends Rule[LogicalPlan] {
outerCTEDefs
}
for ((name, relation) <- relations) {
// If recursion is allowed (RECURSIVE keyword specified)
// then it has higher priority than outer or previous relations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about semantics here. Let's see an example

WITH RECURSIVE t AS ...
  WITH t AS ...
    SELECT ... FROM t

The t references to the non-recursive CTE def t or the recursive one? It looks more reasonable to reference the non-recursive one as it's closer in the scope. It's weird to make recursive CTE def to have higher priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It references the inner non recursive one (just checked). The comment just says that the recursive reference has a higher priority than the outer ones or the ones within in the same WITH statement.

Pajaraja and others added 2 commits April 13, 2025 01:19
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ece4bc0 Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants