-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic evaluation of GroupBy Initial Capacity #14001
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14001 +/- ##
============================================
+ Coverage 61.75% 63.78% +2.02%
- Complexity 207 1555 +1348
============================================
Files 2436 2660 +224
Lines 133233 145740 +12507
Branches 20636 22308 +1672
============================================
+ Hits 82274 92954 +10680
- Misses 44911 45915 +1004
- Partials 6048 6871 +823
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java
Outdated
Show resolved
Hide resolved
...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java
Outdated
Show resolved
Hide resolved
...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java
Outdated
Show resolved
Hide resolved
...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java
Outdated
Show resolved
Hide resolved
These queries are very uncommon, and I don't know if it is worth taking the overhead for all group-by queries. Alternatively, we might just add a query option to override the initial capacity if desired |
@Jackie-Jiang Thanks you for your input! We currently optimize this behavior specifically for However, I see your concern and can certainly add a query option to enable this optimization if desired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. I think a more general problem here is when we group by on columns that partially/none have constraints (in filter clause), but themselves low cardinality. @praveenc7 could you maybe try to explore if it is easy to do some thing like min(cartesian_product, default_min_value). The cartesian_product part can be product of the in/eq filter size or column cardinality? In this sense I feel it would probably more generally applicable. In https://github.com/apache/pinot/pull/9420/files we made some of the cardinality accessible, you can maybe take a look.
@jasperjiaguo Looking more closer at the code, it looks like this condition is already handled in DictionaryBasedGroupKeyGenerator and when intializing the size we do the |
@praveenc7 I see, thanks for checking! I think the only (small) gap here is that when we have
Do we currently handle this mixture of bounded + unbounded? |
5191b1e
to
593c86c
Compare
@jasperjiaguo Agree this might create a more tighter bound, currently the cardinality is available for DictionaryBasedGroupKey so the unbounded case would be mostly applicable to Dictionary based column, might need some more change to support this. Was thinking of doing in a follow up PR |
Let's think about the edge case of
|
@jasperjiaguo Added condition that checks for top-level AND and also added mixed support for both bounded & unbounded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly lgtm, minor comments
} | ||
|
||
Set<Predicate> predicateColumns = new HashSet<>(); | ||
if (filterContext.getType() == FilterContext.Type.AND) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments to this part of the code to explain the logic to handle AND/PREDICATE/OR? Similarly those parts where cardinalities are processed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java
Outdated
Show resolved
Hide resolved
...va/org/apache/pinot/core/query/aggregation/groupby/DictionaryBasedGroupKeyGeneratorTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Summary
For GroupBy queries, the default size of the GroupByResultHolder is set to 10K(or limit length), which can lead to inefficient resource usage in cases where fewer group-by keys are expected, such as in queries with highly selective filters.
select column1, sum(column2) from testTable where column1 in ("123") group by column1 limit 20000
Description
This update dynamically adjusts the initial capacity of the GroupByResultHolder based on the filter predicates for such queries. By aligning the result holder size with the filter, we aim to optimize resource allocation and improve performance for filtered group-by queries.
Testing
We did some perf testing on some prod queries and great improvement in the p99 latency
We observed a 1.5x improvement in P99.9 latency on a set of queries that benefited from the optimization. Below are sample query formats tested:
SELECT column1, SUM(column2) FROM testTable WHERE column1 IN ("123") GROUP BY column1 LIMIT 20000 ``SELECT column1, column2, SUM(column3) FROM testTable WHERE column1 IN ("123") AND column2 = '123' GROUP BY column1, column2 LIMIT 20000
The Flame graph is also much more normal
For queries processed by the new logic but ultimately exiting the optimization due to unmatched conditions (e.g., insufficient GROUP BY clauses), no significant latency overhead was observed. The framegraph results showed similar performance to existing logic:
SELECT column1, column2, SUM(column3) FROM testTable WHERE column1 IN ("123") GROUP BY column1, column2 LIMIT 20000