Dynamic evaluation of GroupBy Initial Capacity #14001

praveenc7 · 2024-09-15T05:19:07Z

Summary

For GroupBy queries, the default size of the GroupByResultHolder is set to 10K(or limit length), which can lead to inefficient resource usage in cases where fewer group-by keys are expected, such as in queries with highly selective filters.

select column1, sum(column2) from testTable where column1 in ("123") group by column1 limit 20000

Description

This update dynamically adjusts the initial capacity of the GroupByResultHolder based on the filter predicates for such queries. By aligning the result holder size with the filter, we aim to optimize resource allocation and improve performance for filtered group-by queries.

Testing

We did some perf testing on some prod queries and great improvement in the p99 latency

Performance Improvement
We observed a 1.5x improvement in P99.9 latency on a set of queries that benefited from the optimization. Below are sample query formats tested:

SELECT column1, SUM(column2) FROM testTable WHERE column1 IN ("123") GROUP BY column1 LIMIT 20000 ``SELECT column1, column2, SUM(column3) FROM testTable WHERE column1 IN ("123") AND column2 = '123' GROUP BY column1, column2 LIMIT 20000

The Flame graph is also much more normal

Impact on Queries with Unmet Conditions
For queries processed by the new logic but ultimately exiting the optimization due to unmatched conditions (e.g., insufficient GROUP BY clauses), no significant latency overhead was observed. The framegraph results showed similar performance to existing logic:

SELECT column1, column2, SUM(column3) FROM testTable WHERE column1 IN ("123") GROUP BY column1, column2 LIMIT 20000

codecov-commenter · 2024-09-15T06:02:32Z

Codecov Report

Attention: Patch coverage is 67.16418% with 22 lines in your changes missing coverage. Please review.

Project coverage is 63.78%. Comparing base (59551e4) to head (325121b).
Report is 1256 commits behind head on master.

Files with missing lines	Patch %	Lines
...upby/NoDictionaryMultiColumnGroupKeyGenerator.java	25.00%	6 Missing and 3 partials ⚠️
...ry/aggregation/groupby/DefaultGroupByExecutor.java	80.00%	1 Missing and 5 partials ⚠️
...pby/NoDictionarySingleColumnGroupKeyGenerator.java	33.33%	3 Missing and 1 partial ⚠️
...tion/groupby/DictionaryBasedGroupKeyGenerator.java	83.33%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14001      +/-   ##
============================================
+ Coverage     61.75%   63.78%   +2.02%     
- Complexity      207     1555    +1348     
============================================
  Files          2436     2660     +224     
  Lines        133233   145740   +12507     
  Branches      20636    22308    +1672     
============================================
+ Hits          82274    92954   +10680     
- Misses        44911    45915    +1004     
- Partials       6048     6871     +823

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.73% <67.16%> (+2.02%)`	⬆️
java-21	`63.67% <67.16%> (+2.04%)`	⬆️
skip-bytebuffers-false	`63.75% <67.16%> (+2.00%)`	⬆️
skip-bytebuffers-true	`63.65% <67.16%> (+35.92%)`	⬆️
temurin	`63.78% <67.16%> (+2.02%)`	⬆️
unittests	`63.77% <67.16%> (+2.02%)`	⬆️
unittests1	`55.47% <67.16%> (+8.58%)`	⬆️
unittests2	`34.18% <0.00%> (+6.45%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java

Jackie-Jiang · 2024-09-26T22:25:38Z

These queries are very uncommon, and I don't know if it is worth taking the overhead for all group-by queries. Alternatively, we might just add a query option to override the initial capacity if desired

praveenc7 · 2024-09-29T22:21:24Z

These queries are very uncommon, and I don't know if it is worth taking the overhead for all group-by queries. Alternatively, we might just add a query option to override the initial capacity if desired

@Jackie-Jiang Thanks you for your input! We currently optimize this behavior specifically for GROUP BY queries with filters, and based on our initial performance testing, the overhead was comparable.

However, I see your concern and can certainly add a query option to enable this optimization if desired

jasperjiaguo

Discussed offline. I think a more general problem here is when we group by on columns that partially/none have constraints (in filter clause), but themselves low cardinality. @praveenc7 could you maybe try to explore if it is easy to do some thing like min(cartesian_product, default_min_value). The cartesian_product part can be product of the in/eq filter size or column cardinality? In this sense I feel it would probably more generally applicable. In https://github.com/apache/pinot/pull/9420/files we made some of the cardinality accessible, you can maybe take a look.

praveenc7 · 2024-10-05T23:08:11Z

Discussed offline. I think a more general problem here is when we group by on columns that partially/none have constraints (in filter clause), but themselves low cardinality. @praveenc7 could you maybe try to explore if it is easy to do some thing like min(cartesian_product, default_min_value). The cartesian_product part can be product of the in/eq filter size or column cardinality? In this sense I feel it would probably more generally applicable. In https://github.com/apache/pinot/pull/9420/files we made some of the cardinality accessible, you can maybe take a look.

@jasperjiaguo Looking more closer at the code, it looks like this condition is already handled in DictionaryBasedGroupKeyGenerator and when intializing the size we do the min(cardinality_product, default_product) or min(cardinality_product, optimized value) [If our optimization is in place] in DefaultGroupByResultHolder

jasperjiaguo · 2024-10-07T16:46:05Z

@praveenc7 I see, thanks for checking! I think the only (small) gap here is that when we have

select ... where A in (a1, a2, a3) group by A,B

Do we currently handle this mixture of bounded + unbounded?
ideally the effective card estimation could be calculated with cardinality_product/card(A) * filterlen(A)

praveenc7 · 2024-10-09T04:50:18Z

@praveenc7 I see, thanks for checking! I think the only (small) gap here is that when we have
select ... where A in (a1, a2, a3) group by A,B
Do we currently handle this mixture of bounded + unbounded? ideally the effective card estimation could be calculated with cardinality_product/card(A) * filterlen(A)

@jasperjiaguo Agree this might create a more tighter bound, currently the cardinality is available for DictionaryBasedGroupKey so the unbounded case would be mostly applicable to Dictionary based column, might need some more change to support this. Was thinking of doing in a follow up PR

jasperjiaguo · 2024-10-09T21:31:47Z

Let's think about the edge case of

select ... where (A in (a1, a2, a3) or C > c) group by A

praveenc7 · 2024-10-13T21:27:02Z

Let's think about the edge case of
select ... where (A in (a1, a2, a3) or C > c) group by A

@jasperjiaguo Added condition that checks for top-level AND and also added mixed support for both bounded & unbounded

jasperjiaguo

mostly lgtm, minor comments

jasperjiaguo · 2024-10-28T17:15:28Z

...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java

+    }
+
+    Set<Predicate> predicateColumns = new HashSet<>();
+    if (filterContext.getType() == FilterContext.Type.AND) {


Can you add some comments to this part of the code to explain the logic to handle AND/PREDICATE/OR? Similarly those parts where cardinalities are processed?

Added comment

...re/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java

...va/org/apache/pinot/core/query/aggregation/groupby/DictionaryBasedGroupKeyGeneratorTest.java

jasperjiaguo

lgtm

Dynamic evaluation of GroupBy Initial Capacity

d6e0cae

praveenc7 changed the title ~~Dynamic evaluation of GroupBy Initial Capacity~~ [WIP]Dynamic evaluation of GroupBy Initial Capacity Sep 15, 2024

vvivekiyer reviewed Sep 16, 2024

View reviewed changes

praveenc7 added 2 commits September 22, 2024 20:53

Add test

075e135

Merge branch 'master' into groupby

197dc3b

praveenc7 changed the title ~~[WIP]Dynamic evaluation of GroupBy Initial Capacity~~ Dynamic evaluation of GroupBy Initial Capacity Sep 24, 2024

praveenc7 added 5 commits September 24, 2024 12:07

fix build

cdde235

nested predicate

b3ed193

more test

d1deb96

fix linter

444433f

test

994e086

praveenc7 marked this pull request as ready for review September 29, 2024 22:21

Query option

d03ed38

praveenc7 force-pushed the groupby branch from 511969e to fbf2abd Compare October 2, 2024 16:33

remove file

8ba3594

praveenc7 force-pushed the groupby branch from fbf2abd to 8ba3594 Compare October 2, 2024 17:06

jasperjiaguo reviewed Oct 2, 2024

View reviewed changes

praveenc7 requested review from vvivekiyer and jasperjiaguo October 6, 2024 20:21

praveenc7 force-pushed the groupby branch 2 times, most recently from 5191b1e to 593c86c Compare October 9, 2024 04:43

praveenc7 force-pushed the groupby branch from 593c86c to cfc6895 Compare October 13, 2024 21:05

Bounded & unbounded solution

853f33d

praveenc7 force-pushed the groupby branch from cfc6895 to 853f33d Compare October 13, 2024 21:25

unit test

ac36438

jasperjiaguo reviewed Oct 28, 2024

View reviewed changes

address review comments

2235732

praveenc7 force-pushed the groupby branch from 1608747 to 2235732 Compare October 29, 2024 00:45

praveenc7 added 3 commits October 28, 2024 17:46

Merge branch 'master' into groupby

d2d770b

address review comments

2765ded

review comments

325121b

praveenc7 requested a review from jasperjiaguo October 29, 2024 02:59

jasperjiaguo approved these changes Oct 30, 2024

View reviewed changes

Jackie-Jiang added enhancement documentation labels Oct 30, 2024

jasperjiaguo merged commit e23d17a into apache:master Oct 31, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic evaluation of GroupBy Initial Capacity #14001

Dynamic evaluation of GroupBy Initial Capacity #14001

praveenc7 commented Sep 15, 2024 •

edited

Loading

codecov-commenter commented Sep 15, 2024 •

edited

Loading

Jackie-Jiang commented Sep 26, 2024

praveenc7 commented Sep 29, 2024 •

edited

Loading

jasperjiaguo left a comment •

edited

Loading

praveenc7 commented Oct 5, 2024 •

edited

Loading

jasperjiaguo commented Oct 7, 2024

praveenc7 commented Oct 9, 2024 •

edited

Loading

jasperjiaguo commented Oct 9, 2024

praveenc7 commented Oct 13, 2024

jasperjiaguo left a comment

jasperjiaguo Oct 28, 2024

praveenc7 Oct 29, 2024

jasperjiaguo left a comment

Dynamic evaluation of GroupBy Initial Capacity #14001

Dynamic evaluation of GroupBy Initial Capacity #14001

Conversation

praveenc7 commented Sep 15, 2024 • edited Loading

Summary

Description

Testing

codecov-commenter commented Sep 15, 2024 • edited Loading

Codecov Report

Jackie-Jiang commented Sep 26, 2024

praveenc7 commented Sep 29, 2024 • edited Loading

jasperjiaguo left a comment • edited Loading

Choose a reason for hiding this comment

praveenc7 commented Oct 5, 2024 • edited Loading

jasperjiaguo commented Oct 7, 2024

praveenc7 commented Oct 9, 2024 • edited Loading

jasperjiaguo commented Oct 9, 2024

praveenc7 commented Oct 13, 2024

jasperjiaguo left a comment

Choose a reason for hiding this comment

jasperjiaguo Oct 28, 2024

Choose a reason for hiding this comment

praveenc7 Oct 29, 2024

Choose a reason for hiding this comment

jasperjiaguo left a comment

Choose a reason for hiding this comment

praveenc7 commented Sep 15, 2024 •

edited

Loading

codecov-commenter commented Sep 15, 2024 •

edited

Loading

praveenc7 commented Sep 29, 2024 •

edited

Loading

jasperjiaguo left a comment •

edited

Loading

praveenc7 commented Oct 5, 2024 •

edited

Loading

praveenc7 commented Oct 9, 2024 •

edited

Loading