Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new rule PinotSeminJoinDistinctProjectRule to apply a distinct to a semi join right side project #14758

Merged
merged 1 commit into from
Jan 11, 2025

Conversation

xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Jan 6, 2025

Add a new rule PinotSeminJoinDistinctProjectRule to apply a distinct to a semi join right side project using hint: append_distinct_to_semi_join_project='true'

The goal is to apply a distinct on the Semi Join Right side Project Node.
E.g. in ColocatedJoinEngineQuickStart :
the query:

SET useMultistageEngine = true;
SET maxRowsInJoin = 100000000;
SET timeoutMs = 300000;
explain plan for
SELECT /*+ joinOptions(append_distinct_to_semi_join_project = 'true') */ 
  distinctCount(a.userUUID),
  a.deviceOS
from userAttributes a
  JOIN (
    SELECT DISTINCT userUUID
    FROM userGroups
    WHERE groupUUID = 'group-1'
  ) g ON a.userUUID = g.userUUID
GROUP BY a.deviceOS

The old query plan is:

Execution Plan
LogicalProject(EXPR$0=[$1], deviceOS=[$0])
  PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[FINAL])
    PinotLogicalExchange(distribution=[hash[0]])
      PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[LEAF])
        LogicalJoin(condition=[=($1, $2)], joinType=[semi])
          LogicalProject(deviceOS=[$4], userUUID=[$6])
            LogicalTableScan(table=[[default, userAttributes]])
          PinotLogicalExchange(distribution=[hash[0]], relExchangeType=[PIPELINE_BREAKER])
            LogicalProject(userUUID=[$4])
              LogicalFilter(condition=[=($3, _UTF-8'group-1')])
                LogicalTableScan(table=[[default, userGroups]])

The new query plan is:

Execution Plan
LogicalProject(EXPR$0=[$1], deviceOS=[$0])
  PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[FINAL])
    PinotLogicalExchange(distribution=[hash[0]])
      PinotLogicalAggregate(group=[{0}], agg#0=[DISTINCTCOUNT($1)], aggType=[LEAF])
        LogicalJoin(condition=[=($1, $2)], joinType=[semi])
          LogicalProject(deviceOS=[$4], userUUID=[$6])
            LogicalTableScan(table=[[default, userAttributes]])
          PinotLogicalExchange(distribution=[hash[0]], relExchangeType=[PIPELINE_BREAKER])
            PinotLogicalAggregate(group=[{0}], aggType=[FINAL])
              PinotLogicalExchange(distribution=[hash[0]])
                PinotLogicalAggregate(group=[{4}], aggType=[LEAF])
                  LogicalFilter(condition=[=($3, _UTF-8'group-1')])
                    LogicalTableScan(table=[[default, userGroups]])

@xiangfu0 xiangfu0 force-pushed the semi-join-distinct-project-rule branch from 6ff9572 to df34109 Compare January 6, 2025 04:25
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang January 6, 2025 04:41
@xiangfu0 xiangfu0 added query multi-stage Related to the multi-stage query engine labels Jan 6, 2025
@xiangfu0 xiangfu0 requested a review from kishoreg January 6, 2025 04:46
@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2025

Codecov Report

Attention: Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.

Project coverage is 63.87%. Comparing base (59551e4) to head (83870e1).
Report is 1557 commits behind head on master.

Files with missing lines Patch % Lines
...e/rel/rules/PinotSeminJoinDistinctProjectRule.java 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14758      +/-   ##
============================================
+ Coverage     61.75%   63.87%   +2.12%     
- Complexity      207     1612    +1405     
============================================
  Files          2436     2705     +269     
  Lines        133233   150995   +17762     
  Branches      20636    23323    +2687     
============================================
+ Hits          82274    96445   +14171     
- Misses        44911    47336    +2425     
- Partials       6048     7214    +1166     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.83% <94.44%> (+2.12%) ⬆️
java-21 63.76% <94.44%> (+2.13%) ⬆️
skip-bytebuffers-false 63.86% <94.44%> (+2.12%) ⬆️
skip-bytebuffers-true 63.74% <94.44%> (+36.01%) ⬆️
temurin 63.87% <94.44%> (+2.12%) ⬆️
unittests 63.86% <94.44%> (+2.12%) ⬆️
unittests1 56.31% <94.44%> (+9.42%) ⬆️
unittests2 34.13% <16.66%> (+6.40%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xiangfu0 xiangfu0 force-pushed the semi-join-distinct-project-rule branch from df34109 to 18efb4f Compare January 6, 2025 17:56
@Jackie-Jiang
Copy link
Contributor

Will this rule prevent us from doing:

SELECT ... FROM t1 WHERE t1.col IN (SELECT t2.col FROM t2)

We don't always want to pay overhead of distinct when we already know the value is unique, or there are very few duplicates.

Even without distinct, the result is still correct, just not necessary the most efficient way to execute. For the example query, in order to achieve the desired query plan, we can write it as:

SELECT
  distinctCount(a.userUUID),
  a.deviceOS
FROM userAttributes a WHERE a.userUUID IN
  (
    SELECT DISTINCT userUUID
    FROM userGroups
    WHERE groupUUID = 'group-1'
  )
GROUP BY a.deviceOS

@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Jan 8, 2025

Will this rule prevent us from doing:

SELECT ... FROM t1 WHERE t1.col IN (SELECT t2.col FROM t2)

We don't always want to pay overhead of distinct when we already know the value is unique, or there are very few duplicates.

Even without distinct, the result is still correct, just not necessary the most efficient way to execute. For the example query, in order to achieve the desired query plan, we can write it as:

SELECT
  distinctCount(a.userUUID),
  a.deviceOS
FROM userAttributes a WHERE a.userUUID IN
  (
    SELECT DISTINCT userUUID
    FROM userGroups
    WHERE groupUUID = 'group-1'
  )
GROUP BY a.deviceOS

I agree with you on the rewrite part.
I feel the main issue here is that the plan generated is a semi join without distinct. Even though the inner query has a distinct.

Technically this should be controllable or we may use a hint to turn on/off the rule?

@xiangfu0 xiangfu0 force-pushed the semi-join-distinct-project-rule branch 2 times, most recently from 56cdf1e to 753eef2 Compare January 8, 2025 19:35
@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Jan 9, 2025

Amended this PR to add a custom rule to add the distinct on the right side PROJECT.

@Jackie-Jiang
Copy link
Contributor

Only the queries with right DISTINCT and no right project regular join can be rewritten to SEMI join. IMO we shouldn't allow such auto-rewrite because it doesn't guarantee better performance. Queries written as regular join should be executed as regular join; written as semi join should be executed as semi join.

@Jackie-Jiang
Copy link
Contributor

I'm not sure if this hint is useful. If user knows to add this hint, they should just write the query as semi join

@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Jan 9, 2025

Only the queries with right DISTINCT and no right project regular join can be rewritten to SEMI join. IMO we shouldn't allow such auto-rewrite because it doesn't guarantee better performance. Queries written as regular join should be executed as regular join; written as semi join should be executed as semi join.

The query mentioned in the PR description is already planned as a Semi join in the query plan.

@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Jan 9, 2025

From user perspective, adding rule is definitely simpler than rewrite query.

And this kind of thing cqn only work for power users who knows how to check query plans and rewrite queries as well.

I'm not sure if this hint is useful. If user knows to add this hint, they should just write the query as semi join

@Jackie-Jiang
Copy link
Contributor

Only the queries with right DISTINCT and no right project regular join can be rewritten to SEMI join. IMO we shouldn't allow such auto-rewrite because it doesn't guarantee better performance. Queries written as regular join should be executed as regular join; written as semi join should be executed as semi join.

The query mentioned in the PR description is already planned as a Semi join in the query plan.

Because it matches the conditions above: right side of JOIN has distinct values; there is no project to right side after join.
The rewrite is done by CoreRules.PROJECT_TO_SEMI_JOIN, so I'm thinking maybe we should remove this rule, and only allow explicit SEMI join by using IN clause.

From user perspective, adding rule is definitely simpler than rewrite query. And this kind of thing can only work for power users who knows how to check query plans and rewrite queries as well.

What I meant is that currently in order to apply the rule, user needs to provide a query hint. From my own experience, knowing how to apply the hint is harder than rewriting the query.

@Jackie-Jiang
Copy link
Contributor

When there is no project to the right table, there are 4 ways to execute:

  • Regular join without distinct on right table
  • Regular join with distinct on right table (not possible, always rewritten to SEMI join)
  • SEMI join without distinct on right table (won't be possible with the new rule)
  • SEMI join with distinct on right table

Ideally, we want to able to control which one to use because each of them might be best for certain use cases.

@xiangfu0 xiangfu0 force-pushed the semi-join-distinct-project-rule branch from 753eef2 to 919c784 Compare January 9, 2025 23:30
@xiangfu0 xiangfu0 force-pushed the semi-join-distinct-project-rule branch from 919c784 to 83870e1 Compare January 10, 2025 06:54
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang January 10, 2025 17:10
@xiangfu0 xiangfu0 merged commit ba21d53 into apache:master Jan 11, 2025
21 checks passed
@xiangfu0 xiangfu0 deleted the semi-join-distinct-project-rule branch January 11, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-stage Related to the multi-stage query engine query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants