Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite BETWEEN to >= AND <= for SV columns in the v1 query engine #14111

Closed
wants to merge 1 commit into from

Conversation

yashmayya
Copy link
Collaborator

@yashmayya yashmayya commented Sep 30, 2024

  • Rewrite BETWEEN to >= AND <= #14102 attempted to add this rewrite in the PredicateComparisonRewriter since that would fix at least two issues with the BETWEEN filter predicate - WHERE intCol BETWEEN 1.5 AND 2.5 currently doesn't work (since the NumericalFilterOptimizer is not applied to BETWEEN) and WHERE col1 BETWEEN col2 AND col3 also doesn't work (since Pinot's v1 query engine only supports filter predicates with literals as operands other than the LHS column identifier).
  • However, the rewrite can't be applied to MV columns since that changes the semantics of the filter predicate. Since the query rewriters are applied during the query compilation phase in v1 which does not have access to the schema, we can't add the column type check there.
  • This patch moves the rewrite to a new FilterOptimizer since the optimizers do have access to the schema. However, this won't be able to solve the WHERE col1 BETWEEN col2 AND col3 kind of use case since that re-write is only applied to supported filter predicates in PredicateComparisonRewriter which happens before the optimizers are applied.
  • We can't simply move that rewrite into this new optimizer (i.e., replacing col1 BETWEEN col2 AND col3 with col1 - col2 >= 0 AND col1 - col3 <= 0) since the assumption that filter predicates only have literals in all operands (other than the LHS column identifier) is baked into multiple other places - see here, here for instance.
  • Note that we can't apply the rewrites in NumericalFilterOptimizer without reducing the BETWEEN since intCol BETWEEN 5.5 AND 10.5 / intCol >= 5.5 AND intCol <= 10.5 / intCol > 5 AND intCol <= 10 are not equivalent to intCol BETWEEN 5 AND 10 (which would be intCol >=5 AND intCol <= 10).

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 64.04%. Comparing base (59551e4) to head (762605f).
Report is 1101 commits behind head on master.

Files with missing lines Patch % Lines
...query/optimizer/filter/BetweenFilterOptimizer.java 82.60% 1 Missing and 3 partials ⚠️
.../parsers/rewriter/PredicateComparisonRewriter.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14111      +/-   ##
============================================
+ Coverage     61.75%   64.04%   +2.29%     
- Complexity      207     1538    +1331     
============================================
  Files          2436     2597     +161     
  Lines        133233   143376   +10143     
  Branches      20636    21962    +1326     
============================================
+ Hits          82274    91832    +9558     
+ Misses        44911    44783     -128     
- Partials       6048     6761     +713     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 64.02% <80.00%> (+2.31%) ⬆️
java-21 63.92% <80.00%> (+2.29%) ⬆️
skip-bytebuffers-false 64.03% <80.00%> (+2.29%) ⬆️
skip-bytebuffers-true 63.89% <80.00%> (+36.16%) ⬆️
temurin 64.04% <80.00%> (+2.29%) ⬆️
unittests 64.04% <80.00%> (+2.29%) ⬆️
unittests1 55.72% <80.00%> (+8.83%) ⬆️
unittests2 34.50% <70.00%> (+6.77%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya force-pushed the between-filter-optimizer branch from deff3cc to 762605f Compare September 30, 2024 10:17
@yashmayya yashmayya marked this pull request as ready for review September 30, 2024 11:17
@yashmayya yashmayya changed the title Rewrite BETWEEN to >= AND <= for SV columns Rewrite BETWEEN to >= AND <= for SV columns in the v1 query engine Sep 30, 2024
@Jackie-Jiang
Copy link
Contributor

I'm a little bit confused here. Why can't we rewrite BETWEEN within NumericalFilterOptimizer? intCol BETWEEN 5.5 AND 10.5 should be equivalent to intCol BETWEEN 6 AND 10 per the logic in NumericalFilterOptimizer if we combine intCol >= 5.5 and intCol <= 10.5.
What is the standard behavior of intCol BETWEEN 5.5 AND 10.5?

@yashmayya
Copy link
Collaborator Author

The current logic in NumericalFilterOptimizer casts the RHS literal to the LHS column type and then depending on whether the converted value is less than or greater than the original literal value, the comparison operator is changed accordingly (if the cast value is equal, no change is required to the operator). Basically we're modifying the operator to maintain the semantics of the filter predicate rather than modifying the literal. It's definitely possible to add more custom logic to convert intCol BETWEEN 5.5 AND 10.5 to intCol BETWEEN 6 AND 10 which will indeed maintain the filter predicate semantics. However, this seems more brittle and might introduce more edge cases along with not being able to reduce cases like intCol BETWEEN -2147483649 AND 2147483648 (i.e., literals that are long values outside of int range) which is easily doable when reducing the BETWEEN to >= AND <=. Although originally I wanted to do this BETWEEN rewrite to also be able to reap the benefits of the predicate comparison rewrites to support comparison between columns which is no longer the case with this move from the query rewrite phase to the filter optimizer phase. So, if you prefer to remove this rewrite altogether and only update NumericalFilterOptimizer for now, I think that should be fine too.

@Jackie-Jiang
Copy link
Contributor

I'm thinking to use the existing logic in NumericalFilterOptimizer to process the lower bound and upper bound, then assemble them back to BETWEEN, >=, <=, true or false. The main benefit of this approach is that it also works on MV columns. wdyt?

@yashmayya
Copy link
Collaborator Author

I'm thinking to use the existing logic in NumericalFilterOptimizer to process the lower bound and upper bound, then assemble them back to BETWEEN, >=, <=, true or false. The main benefit of this approach is that it also works on MV columns. wdyt?

Makes sense, I'm closing this PR since it's superseded by #14163. I'll raise a small separate PR with the tests for PredicateComparisonRewriter that I originally wrote in #14102.

@yashmayya yashmayya closed this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants