Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BETWEEN support to NumericalFilterOptimizer #14163

Merged

Conversation

yashmayya
Copy link
Collaborator

  • Currently, a filter predicate like WHERE intCol BETWEEN 0.0 AND 100.0 results in an error on the v1 query engine. However, a filter predicate like WHERE intCol >= 0.0 AND intCol <= 100.0 (semantically equivalent to the BETWEEN filter predicate) works fine.
  • The reason for this is that the optimizations in NumericalFilterOptimizer don't apply to the BETWEEN filter predicate. In the v2 query engine, Calcite implicitly applies the appropriate casts to the column for such heterogeneous expressions, so they work fine (although there is a big performance penalty and indexes won't be usable). In the v1 engine, to avoid such performance penalties, the filter predicates are rewritten to use the appropriate literal types with conversion.
  • This patch applies the optimization to the BETWEEN filter predicate as well.
  • Earlier attempts to solve this issue involved rewriting the BETWEEN filter predicate to the equivalent >= AND <= filter predicates (similar to what Calcite does in the v2 engine), however we ran into issues with MV columns - Rewrite BETWEEN to >= AND <= #14102, Rewrite BETWEEN to >= AND <= for SV columns in the v1 query engine #14111.

@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 83.33333% with 13 lines in your changes missing coverage. Please review.

Project coverage is 63.84%. Comparing base (59551e4) to head (454fa24).
Report is 1170 commits behind head on master.

Files with missing lines Patch % Lines
...ery/optimizer/filter/NumericalFilterOptimizer.java 83.33% 3 Missing and 10 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14163      +/-   ##
============================================
+ Coverage     61.75%   63.84%   +2.09%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2622     +186     
  Lines        133233   144459   +11226     
  Branches      20636    22127    +1491     
============================================
+ Hits          82274    92233    +9959     
- Misses        44911    45414     +503     
- Partials       6048     6812     +764     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.78% <83.33%> (+2.07%) ⬆️
java-21 63.73% <83.33%> (+2.11%) ⬆️
skip-bytebuffers-false 63.79% <83.33%> (+2.05%) ⬆️
skip-bytebuffers-true 63.71% <83.33%> (+35.98%) ⬆️
temurin 63.84% <83.33%> (+2.09%) ⬆️
unittests 63.84% <83.33%> (+2.09%) ⬆️
unittests1 55.46% <83.33%> (+8.57%) ⬆️
unittests2 34.37% <19.23%> (+6.64%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya force-pushed the between-numerical-filter-optimizer branch from e77f503 to 0975b60 Compare October 4, 2024 12:27
@yashmayya
Copy link
Collaborator Author

I've removed the logic to prune BETWEEN filter predicates where the lower bound is larger than the upper bound since a number of tests cases that have been generated using the query generator have such BETWEEN filter predicates and currently Pinot returns an empty response in such cases, without even a result table (see #13057). This breaks the testing logic; we can add this filter pruning logic back once #13831 (or something similar) is merged.

switch (lower.getLiteral().getSetField()) {
case LONG_VALUE: {
long actual = lower.getLiteral().getLongValue();
// Other data types can be converted on the server side.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is a reduced version of the existing rewriteRangeExpression(). To keep the behavior consistent, are we able to extract out the logic of rewriting lower bound and upper bound, then use the common logic for both rewriteRangeExpression() and rewriteBetweenExpression()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider consolidating the two but the major difference is that in rewriteRangeExpression we're rewriting the operators too (and avoiding manually modifying the literal value other than the cast) which we don't want to do for BETWEEN (as discussed in #14111). The other alternative would be to change rewriteRangeExpression to avoid rewriting the operators but I thought it'd be better not to change the behavior of the existing optimizer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I still don't follow. Do you mean we should not rewrite BETWEEN the same way as other range filters, or is it too complicated? As long as we computed lower and upper bound, we should be able to assemble it back to a BETWEEN.
It is also fine to do it separately

Copy link
Collaborator Author

@yashmayya yashmayya Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should not rewrite BETWEEN the same way as other range filters

Yeah, basically this. For instance, taking intCol >= 2.5 as an example. 2.5 is cast to 2 (int), and then the >= is rewritten to > because actual - converted > 0 resulting in intCol > 2. For BETWEEN, we want to instead rewrite intCol BETWEEN 2.5 AND y to intCol BETWEEN 3 AND y. We could change the logic for regular range filter to rewrite intCol >= 2.5 to intCol >= 3 instead to match the BETWEEN rewrite logic - is that what you're suggesting? There are some other differences too though. For instance, floatCol < longLiteral can be rewritten to floatCol <= castedFloatLiteral depending on the comparison between longLiteral and castedFloatLiteral. We can't do the same for BETWEEN though, and we simply skip any conversion in these cases, allowing the server to do the cast. Given these differences, it seemed better overall to keep these rewrites separate, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Ideally we want the behavior to be consistent across all range filters. Given the difficulties and the scope of this PR, let's add a TODO to revisit in the future

@@ -101,6 +111,15 @@ Expression optimizeChild(Expression filterExpression, @Nullable Schema schema) {
}
}
}

if (kind == FilterKind.BETWEEN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced in this PR, but we have multiple levels of switch and if on kind, which is quite hard to read. We didn't check whether lower/upper bound is numeric literal here, which is different with other filter kinds. Is this intentional?

Actually we probably shouldn't check that and also allow rewrite on string range. E.g. intCol = '1.0' should be valid

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't check whether lower/upper bound is numeric literal here, which is different with other filter kinds. Is this intentional?

The check is in the rewriteBetweenExpression method itself, since we want to do the rewrite even if one of lower bound / upper bound is a numeric literal ideally (and I didn't want to add this into the common logic for other filter kinds).

Actually we probably shouldn't check that and also allow rewrite on string range. E.g. intCol = '1.0' should be valid

Postgres doesn't allow this btw, do we still want to add support for this?

Not introduced in this PR, but we have multiple levels of switch and if on kind, which is quite hard to read

Agreed, I've pushed a commit trying to clean this up - let me know if this looks good or if you had something else in mind.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@@ -1005,6 +1005,16 @@ public QueryFragment generatePredicate(String columnName, boolean useMultistageE
List<String> columnValues = _columnToValueList.get(columnName);
String leftValue = pickRandom(columnValues);
String rightValue = pickRandom(columnValues);

if (_singleValueNumericalColumnNames.contains(columnName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add this in order for the test to pass? We probably want to test scenarios when lower is larger than higher (always false scenario)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd added it before this commit that removed the BETWEEN filter pruning for lower > higher due to this issue - #14163 (comment). I've removed this change too now.

switch (lower.getLiteral().getSetField()) {
case LONG_VALUE: {
long actual = lower.getLiteral().getLongValue();
// Other data types can be converted on the server side.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I still don't follow. Do you mean we should not rewrite BETWEEN the same way as other range filters, or is it too complicated? As long as we computed lower and upper bound, we should be able to assemble it back to a BETWEEN.
It is also fine to do it separately

case BETWEEN: {
// Verify that value is a numeric column before rewriting.
List<Expression> operands = function.getOperands();
Expression value = operands.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Suggest naming it lhs. value is a little bit misleading here as it is usually a column or a function.

Currently for other filter kinds we check whether rhs is numeric but not here. That check itself seems unnecessary, and we can consider consolidating the logic by only checking if lhs is numeric

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for other filter kinds we check whether rhs is numeric but not here. That check itself seems unnecessary, and we can consider consolidating the logic by only checking if lhs is numeric

Good call, it looks a lot cleaner now. I've just retained the rhs literal check since the rewrite methods assume that rhs is a literal and then it handles all the data types appropriately anyway (although it should already be guaranteed that it is a literal due to the compile time functions invoker and the predicate comparison query rewriter that are run before these optimizers).

@Jackie-Jiang
Copy link
Contributor

Please take a look at the failed tests

@yashmayya
Copy link
Collaborator Author

Please take a look at the failed tests

They all look unrelated:

Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.15.1:install-node-and-npm (install node and npm) on project pinot-controller: Could not download Node.js: Could not download https://nodejs.org/dist/v16.15.0/node-v16.15.0-linux-x64.tar.gz: Premature end of Content-Length delimited message body (expected: 32,989,472; received: 3,874,816)
Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.13.0:compile (default-compile) on project pinot-common: Compilation failure: Compilation failure: 
Error:  /home/runner/work/pinot/pinot/pinot-common/src/main/java/org/apache/pinot/sql/parsers/CalciteSqlParser.java:[69,42] error: cannot find symbol
Error:    symbol:   class SqlParserImpl
Error:    location: package org.apache.pinot.sql.parsers.parser
Error:  /home/runner/work/pinot/pinot/pinot-common/src/main/java/org/apache/pinot/sql/parsers/CalciteSqlParser.java:[413,9] error: cannot find symbol
Error:    symbol:   class SqlParserImpl
Error:    location: class CalciteSqlParser
Error:  Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.3.0:test (default-test) on project pinot-flink-connector: There are test failures.

Triggering a rerun.

@yashmayya yashmayya force-pushed the between-numerical-filter-optimizer branch from f05689d to 454fa24 Compare October 10, 2024 05:22
@Jackie-Jiang Jackie-Jiang merged commit 782e769 into apache:master Oct 11, 2024
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants