-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamp in MSE #14690
Timestamp in MSE #14690
Conversation
… into timestamp-v2
cc @bziobrowski |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14690 +/- ##
============================================
+ Coverage 61.75% 63.88% +2.12%
- Complexity 207 1607 +1400
============================================
Files 2436 2703 +267
Lines 133233 150771 +17538
Branches 20636 23296 +2660
============================================
+ Hits 82274 96314 +14040
- Misses 44911 47259 +2348
- Partials 6048 7198 +1150
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
+ " AggregateFiltered(aggregations=[[sum('2'), count(*)]])\n" | ||
+ " Transform(expressions=[['2']])\n" | ||
+ " Project(columns=[[]])\n" | ||
+ " DocIdSet(maxDocs=[120000])\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to go over 120 chars limit here with //@Formatter:off ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but I didn't try
+ " FilterRangeIndex(predicate=[$ArrTime$SECOND > '1'], " | ||
+ "indexLookUp=[range_index], operator=[RANGE])\n"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth checking queries with :
- GROUP BY datetrunc('SECOND',ArrTime)
- HAVING datetrunc('SECOND',ArrTime) > 1
- HAVING sum(case when datetrunc('SECOND',ArrTime) > 1 then 2 else 0 end) > 1
- UNION-ed second query ?
- WHERE EXISTS (select * from tab t2 where datetrunc('SECOND',ArrTime) > 1 )
- WHERE EXISTS (select * from tab t2 where t1.x = datetrunc('SECOND',t2.ArrTime) > 1 ) )
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GROUP BY datetrunc('SECOND',ArrTime)
Added
HAVING datetrunc('SECOND',ArrTime) > 1
HAVING sum(case when datetrunc('SECOND',ArrTime) > 1 then 2 else 0 end) > 1
In these cases the optimization is not supported in SSE and it isn't easy to support it. The main problem here is that InstancePlanMakerImplV2.rewriteQueryContextWithHints
rewrites the query using a dirty trick: it modifies the lists returned by queryContext.getSelectExpressions()
, queryContext.getGroupByExpressions()
, etc. In the queryContext.getAggregationFunctions()
case, it returns a list of AggregationFunction
which have their own AggregationFunction.getInputExpressions()
. But it is not guaranteed that modifying the list returned by that method actually modifies the AggregationFunction
. In fact in cases most cases it doesn't.
UNION-ed second query ?
Unions don't have expressions right? In case one of the unioned-queries have expressions, it should be treated by the other cases using recursion.
This tests class is testing more than this feature. It is testing that the substitution is being done, which means that the overrideHint is added (what this PR changes) and that the override is actually applied (what is failing in some cases). We could create a test that just verifies that the overrideHint is added, but that may be a complex test to add, so I think the tests we have right now are good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHERE EXISTS (select * from tab t2 where datetrunc('SECOND',ArrTime) > 1 )
WHERE EXISTS (select * from tab t2 where t1.x = datetrunc('SECOND',t2.ArrTime) > 1 ) )
This is tricky. Would we want to test all possible SQL constructors? In theory... maybe... But I don't think it is worth it. We know we are applying the hint with a visitor that walks through the relational nodes. Therefore we know we will visit the sub-query filters. Which means that it is the same case as a select in the main query.
Obviously that is a grey area. A future refactor may not use a visitor and therefore may end up not visiting the inner query. But I don't think it is worth to test so deep. Probably, as said before, a lower level test (ie a unit test on the visitor) should make it more explicit why that case isn't worth to be tested, but here I'm using an integration test to test the overrideHint logic, which isn't great but is easier than testing ServerPlanRequestVisitor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked HAVING on a few queries locally and think it's fine because HAVING expression is applied after aggregation by matching expression to group by output (so no additional aggregates are computed if it matches group by element).
Regarding UNION and WHERE EXISTS - I was thinking of checking if rule applies to all subqueries. As you mentioned - it's likely better to test on a lower level, for all rules/types .
@@ -31,6 +31,8 @@ | |||
*/ | |||
public interface RexExpression { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these methods used anywhere ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact they are not. Having the visitor for the future may be useful, but it is also very simple to add again, so I'm removing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be useful to also have an actual integration test to verify the query execution path (and not just the explain plan output).
...rc/main/java/org/apache/pinot/broker/requesthandler/BaseSingleStageBrokerRequestHandler.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/utils/request/RequestUtils.java
Outdated
Show resolved
Hide resolved
...n-test-base/src/test/java/org/apache/pinot/integration/tests/BaseClusterIntegrationTest.java
Outdated
Show resolved
Hide resolved
List<Pair<AggregationFunction, FilterContext>> filtAggrFuns = queryContext.getFilteredAggregationFunctions(); | ||
if (filtAggrFuns != null) { | ||
for (Pair<AggregationFunction, FilterContext> filteredAggregationFunction : filtAggrFuns) { | ||
FilterContext right = filteredAggregationFunction.getRight(); | ||
if (right != null) { | ||
Predicate predicate = right.getPredicate(); | ||
predicate.setLhs(overrideWithExpressionHints(predicate.getLhs(), indexSegment, expressionOverrideHints)); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an integration test that verifies this filtered aggregation on timestamp column with timestamp index execution path on both the query engines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test timestampIndexSubstitutedInAggregateFilter already tested this code in MME. I've just added the same test class for SSE.
private void applyTimestampIndex(Expression expression, PinotQuery pinotQuery) { | ||
RequestUtils.applyTimestampIndex(expression, pinotQuery); | ||
Function functionCall = expression.getFunctionCall(); | ||
if (expression.isSetFunctionCall()) { | ||
for (Expression operand : functionCall.getOperands()) { | ||
applyTimestampIndex(operand, pinotQuery); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here we're adding the expression override hints regardless of the actual timestamp index configurations and relying on the server to only apply the correct ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a bit strange. That is the way V1 does it. I don't completely understand why we do that on brokers instead of doing that on servers depending on whether we have the column + index or not, the same way we do that with other indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a bit strange. That is the way V1 does it
In v1, the broker only adds the expression override hints if there is an appropriately configured timestamp index right?
Lines 950 to 953 in 397fd98
if (timestampIndexColumns.contains(timeColumnWithGranularity)) { | |
pinotQuery.putToExpressionOverrideHints(expression, | |
RequestUtils.getIdentifierExpression(timeColumnWithGranularity)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering my own question, in v2 the expression override won't be applied on the server side due to the check here -
pinot/pinot-core/src/main/java/org/apache/pinot/core/plan/maker/InstancePlanMakerImplV2.java
Lines 395 to 396 in 0e915ed
if (overrideExpression != null && overrideExpression.getIdentifier() != null && indexSegment.getColumnNames() | |
.contains(overrideExpression.getIdentifier())) { |
I suppose this isn't ideal since in v1 we're only adding the expression override hint to the query when there is an appropriate timestamp index configured, but I've verified that things work as expected regardless due to the server side checks.
...tion-tests/src/test/java/org/apache/pinot/integration/tests/ExplainIntegrationTestTrait.java
Show resolved
Hide resolved
I think all discussions have been resolved or are waiting for further comments. Yash, feel free to take a second look |
It looks like the new sse test if failing during setup. I'm having troubles to understand why given the setup process is identical in mse. |
Yeah that's quite odd - the new tests pass locally for me. |
I was able to reproduce the error in the test. It looks like some other test (probably the MSE one) id modifying the config and leaving it in an incorrect state for the SSE test |
Change the tests to return a mutable list instead.
I was correct. MSE and SSE tests were interfering with each other. The cause was the change I made to make default columns mutable, which was incorrect. First MSE was adding the new range indexes and then SSE wanted to add them again. Instead of changing a static reference, which is never a good idea, I'm changing the code so all tests return a new ArrayList for these columns, so each test can modify them independently. Changing the columns is not great, but that is how these features have been implemented and isn't easy to change that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Checkstyle is currently failing due to an unused import in BaseClusterIntegrationTest
though.
This PR adds the ability to use timestamp indexes in MSE. As we know, timestamp indexes are not actual indexes but syntactic sugar. When a timestamp index is created, a set of granularities have to be defined. Then when segments are created, a new column is created for each granularity. The content of each column is equal to
datetrunc(GRANULARITY, originalColumn)
.Then in SSE, the broker enriches queries adding possible overrides on the query. This means that for each usage of
datetrunc(GRANULARITY, originalColumn)
, the broker suggest to transform the call to the generated column$originalColumn$GRANULARITY
. Whether that suggestion is applied or not is defined by the server.In MSE these suggested rewrites were not populated, which means that servers never applied the rewrite.
Specifically, queries like:
Can now be optimized.
While creating some tests that verify this PR I've found that cases like
were not optimized. Therefore this PR modifies
InstancePlanMakerImplV2
to apply the rewrite in that case as well.