Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ROLLUP, CUBE grouping expansion #783

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,30 @@ message AggregateRel {
// A list of zero or more references to grouping expressions, i.e., indices
// into the `grouping_expression` list.
repeated uint32 expression_references = 2;

// An optional list of zero or more grouping id of `expression_references`
// at corresponding index. The expressions sharing the same non-zero values
// in this field formulates a composite grouping expression (i.e., those
// columns considered as one grouping expression).
// For GROUNPING_EXPANSION_UNSPECIFIED, this field is ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Typo: GROUNPING_EXPANSION_UNSPECIFIED

//
// Example:
// {
// expression_references: [ 3, 1, 4, 5, 2 ],
// composite_grouping_expressions: [ 1, 1, 0, 2, 2 ],
// expansion: GROUPING_EXPANSION_CUBE
// },
//
// means CUBE((3, 1), 4, (5, 2)), thus expands to following grouping sets:
//
// (),
// (3, 1), (4), (5, 2),
// (3, 1, 4), (3, 1, 5, 2), (4, 5, 2),
// (3, 1, 4, 5, 2)
repeated uint32 composite_grouping_expressions = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to create an alternative representation rather than the double array representation. This feels a bit complicated.

As part of that, it might be best if it also includes the indiciator value explicitly rather than the more complex pattern you describe below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this is proposal, not a demand. What are you thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was never happy with this representation but just put out here to get some feedback. :-) The reason why I chose to this was because of the abundant usage of index/offset in other places. If this feels complex to you, we should make it more explicit, perhaps, use oneof and more descriptive message to encode these albeit it will introduce another level of nesting.

I don't know whether we will like it but I will draft that. Let's see how we like it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n BTW, could you elaborate the indicator value in your comment? You mean the indicator value used to bundle (i'll use different word as "group" is very overloaded here) the grouping columns (i.e., the values in the composite_grouping_expressions)? Or something else?


// Optional grouping expansion method. Default: no expansion.
GroupingExpansion expansion = 4;
}

message Measure {
Expand All @@ -364,6 +388,17 @@ message AggregateRel {
// Helps to support SUM(<c>) FILTER(WHERE...) syntax without masking opportunities for optimization
Expression filter = 2;
}

enum GroupingExpansion {
// No grouping expansion. Default.
GROUPING_EXPANSION_UNSPECIFIED = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in other locations NONE is specified separately from UNSPECIFICED. Can you confirm? If so, we should match that pattern here.


// N grouping expression groups are expanded into N+1 groupings.
GROUPING_EXPANSION_ROLLUP = 1;

// N grouping expression groups are expanded into 2^N grounpings.
GROUPING_EXPANSION_CUBE = 2;
}
}

// ConsistentPartitionWindowRel provides the ability to perform calculations across sets of rows
Expand Down
6 changes: 5 additions & 1 deletion site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,10 +381,14 @@ Grouping sets can be used for finer-grained control over which records are folde

It is possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. The columns for grouping expressions that do *not* appear in *all* grouping sets will be nullable (regardless of the nullability of the type returned by the grouping expression) to accomodate the null insertion.

To further disambiguate which record belongs to which grouping set, an aggregate relation with more than one grouping set receives an extra `i32` column on the right-hand side. The value of this field will be the zero-based index of the grouping set that yielded the record.
To further disambiguate which record belongs to which grouping set, an aggregate relation with more than one grouping set receives an extra `i32` indicator column on the right-hand side. The value of this field will be the zero-based index of the grouping set that yielded the record.

If at least one grouping expression is present, the aggregation is allowed to not have any aggregate expressions. An aggregate relation is invalid if it would yield zero columns.

Optionally, a grouping set can be specified with rollup or cube expansions rather than fully spelling out the grouping sets. When either rollup or cube is specified for a grouping set, permutations of grouping expressions will be generated accordingly. Given `n` grouping expressions for a grouping set, ROLLUP expands to n + 1 grouping sets. CUBE expands to 2^`n` grouping sets (i.e., all subsets of the `n` grouping expressions). Given `i`th grouping set with ROLLUP expansion, the indicator values are assigned from `i` (i.e., an empty grouping) to `i + n + 1` (i.e., the grouping as if no expansion was done). For CUBE, the indicator values are assigned from `i` (i.e., an empty grouping) to `i + 2^n` (i.e., the grouping as if no expanson was done) where the values are assigned as `n`-bit integer where the bit is set when a grouping expression at corresponding position is being grouped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some concrete examples in a table or similar here?


With ROLLUP and CUBE, the grouping expressions can form a group and treated as a unit of expansion. To represent such grouping of expressions within grouping set, an optional list of grouping information can be specified. For example, given grouping expressions `[3, 1, 2, 4, 5]` with the list `[1, 1, 0, 1, 1]` represents that there are actually 3 grouping expressions: `(3, 1), 2, (4, 5)`. That is, grouping expressions `3, 1` together formulate one composite grouping expression, and so does `4, 5`. `2`, specified with value `0` do not forumate such composite grouping.

### Aggregate Properties

| Property | Description | Required |
Expand Down
Loading