-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BETWEEN support to NumericalFilterOptimizer #14163
Add BETWEEN support to NumericalFilterOptimizer #14163
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14163 +/- ##
============================================
+ Coverage 61.75% 63.84% +2.09%
- Complexity 207 1535 +1328
============================================
Files 2436 2622 +186
Lines 133233 144459 +11226
Branches 20636 22127 +1491
============================================
+ Hits 82274 92233 +9959
- Misses 44911 45414 +503
- Partials 6048 6812 +764
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
e77f503
to
0975b60
Compare
I've removed the logic to prune |
switch (lower.getLiteral().getSetField()) { | ||
case LONG_VALUE: { | ||
long actual = lower.getLiteral().getLongValue(); | ||
// Other data types can be converted on the server side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is a reduced version of the existing rewriteRangeExpression()
. To keep the behavior consistent, are we able to extract out the logic of rewriting lower bound and upper bound, then use the common logic for both rewriteRangeExpression()
and rewriteBetweenExpression()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider consolidating the two but the major difference is that in rewriteRangeExpression
we're rewriting the operators too (and avoiding manually modifying the literal value other than the cast) which we don't want to do for BETWEEN
(as discussed in #14111). The other alternative would be to change rewriteRangeExpression
to avoid rewriting the operators but I thought it'd be better not to change the behavior of the existing optimizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part I still don't follow. Do you mean we should not rewrite BETWEEN
the same way as other range filters, or is it too complicated? As long as we computed lower and upper bound, we should be able to assemble it back to a BETWEEN
.
It is also fine to do it separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should not rewrite BETWEEN the same way as other range filters
Yeah, basically this. For instance, taking intCol >= 2.5
as an example. 2.5
is cast to 2
(int), and then the >=
is rewritten to >
because actual - converted > 0
resulting in intCol > 2
. For BETWEEN
, we want to instead rewrite intCol BETWEEN 2.5 AND y
to intCol BETWEEN 3 AND y
. We could change the logic for regular range filter to rewrite intCol >= 2.5
to intCol >= 3
instead to match the BETWEEN
rewrite logic - is that what you're suggesting? There are some other differences too though. For instance, floatCol < longLiteral
can be rewritten to floatCol <= castedFloatLiteral
depending on the comparison between longLiteral
and castedFloatLiteral
. We can't do the same for BETWEEN
though, and we simply skip any conversion in these cases, allowing the server to do the cast. Given these differences, it seemed better overall to keep these rewrites separate, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation. Ideally we want the behavior to be consistent across all range filters. Given the difficulties and the scope of this PR, let's add a TODO to revisit in the future
@@ -101,6 +111,15 @@ Expression optimizeChild(Expression filterExpression, @Nullable Schema schema) { | |||
} | |||
} | |||
} | |||
|
|||
if (kind == FilterKind.BETWEEN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not introduced in this PR, but we have multiple levels of switch
and if
on kind
, which is quite hard to read. We didn't check whether lower/upper bound is numeric literal here, which is different with other filter kinds. Is this intentional?
Actually we probably shouldn't check that and also allow rewrite on string range. E.g. intCol = '1.0'
should be valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't check whether lower/upper bound is numeric literal here, which is different with other filter kinds. Is this intentional?
The check is in the rewriteBetweenExpression
method itself, since we want to do the rewrite even if one of lower bound / upper bound is a numeric literal ideally (and I didn't want to add this into the common logic for other filter kinds).
Actually we probably shouldn't check that and also allow rewrite on string range. E.g. intCol = '1.0' should be valid
Postgres doesn't allow this btw, do we still want to add support for this?
Not introduced in this PR, but we have multiple levels of switch and if on kind, which is quite hard to read
Agreed, I've pushed a commit trying to clean this up - let me know if this looks good or if you had something else in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
@@ -1005,6 +1005,16 @@ public QueryFragment generatePredicate(String columnName, boolean useMultistageE | |||
List<String> columnValues = _columnToValueList.get(columnName); | |||
String leftValue = pickRandom(columnValues); | |||
String rightValue = pickRandom(columnValues); | |||
|
|||
if (_singleValueNumericalColumnNames.contains(columnName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you add this in order for the test to pass? We probably want to test scenarios when lower is larger than higher (always false scenario)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd added it before this commit that removed the BETWEEN
filter pruning for lower > higher due to this issue - #14163 (comment). I've removed this change too now.
switch (lower.getLiteral().getSetField()) { | ||
case LONG_VALUE: { | ||
long actual = lower.getLiteral().getLongValue(); | ||
// Other data types can be converted on the server side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part I still don't follow. Do you mean we should not rewrite BETWEEN
the same way as other range filters, or is it too complicated? As long as we computed lower and upper bound, we should be able to assemble it back to a BETWEEN
.
It is also fine to do it separately
case BETWEEN: { | ||
// Verify that value is a numeric column before rewriting. | ||
List<Expression> operands = function.getOperands(); | ||
Expression value = operands.get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) Suggest naming it lhs
. value
is a little bit misleading here as it is usually a column or a function.
Currently for other filter kinds we check whether rhs
is numeric but not here. That check itself seems unnecessary, and we can consider consolidating the logic by only checking if lhs
is numeric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for other filter kinds we check whether rhs is numeric but not here. That check itself seems unnecessary, and we can consider consolidating the logic by only checking if lhs is numeric
Good call, it looks a lot cleaner now. I've just retained the rhs
literal check since the rewrite methods assume that rhs
is a literal and then it handles all the data types appropriately anyway (although it should already be guaranteed that it is a literal due to the compile time functions invoker and the predicate comparison query rewriter that are run before these optimizers).
Please take a look at the failed tests |
…als has correct ordering since returned result table will be empty otherwise
They all look unrelated:
Triggering a rerun. |
f05689d
to
454fa24
Compare
WHERE intCol BETWEEN 0.0 AND 100.0
results in an error on the v1 query engine. However, a filter predicate likeWHERE intCol >= 0.0 AND intCol <= 100.0
(semantically equivalent to theBETWEEN
filter predicate) works fine.NumericalFilterOptimizer
don't apply to theBETWEEN
filter predicate. In the v2 query engine, Calcite implicitly applies the appropriate casts to the column for such heterogeneous expressions, so they work fine (although there is a big performance penalty and indexes won't be usable). In the v1 engine, to avoid such performance penalties, the filter predicates are rewritten to use the appropriate literal types with conversion.BETWEEN
filter predicate as well.BETWEEN
filter predicate to the equivalent >= AND <= filter predicates (similar to what Calcite does in the v2 engine), however we ran into issues with MV columns - Rewrite BETWEEN to >= AND <= #14102, Rewrite BETWEEN to >= AND <= for SV columns in the v1 query engine #14111.