-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Distinct SimpleAggregate #4934
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4934 +/- ##
==========================================
+ Coverage 86.55% 86.58% +0.02%
==========================================
Files 1409 1413 +4
Lines 60915 61192 +277
Branches 7492 7523 +31
==========================================
+ Hits 52727 52984 +257
- Misses 8019 8036 +17
- Partials 169 172 +3 ☔ View full report in Codecov by Sentry. |
Benchmark ResultMaster commit hash:
|
Benchmark ResultMaster commit hash:
|
83356c2
to
b1e7640
Compare
b1e7640
to
8503584
Compare
Benchmark ResultMaster commit hash:
|
Extends the same method used to parallelize the hash aggregate and distinct hash aggregate to the simple, non-grouped aggregates.
There is still a reasonable amount of work being done single-threaded since after doing the parallel build + finalize of the distinct hash table we have to then scan it to actually compute the function result and that's currently being done in the single-threaded finalize since it can't be started until the parallel finalization is complete.It would be possible to add another operator which does that in parallel, which I think could yield up to another 2x improvement to the runtime (skipping this step dropped both of the below benchmarks to ~0.9s, so this single-threaded part is clearly a significant part of the total runtime). But I think that's best left for a different PR.
(Edit: this can actually be done in the current operator fairly easily, we just need to store a state for each partition and combine them in finalize instead).
MATCH (h:hits) RETURN COUNT(DISTINCT h.UserID);
MATCH (h:hits) RETURN COUNT(DISTINCT h.SearchPhrase);
Probably needs a bit more cleanup so I'm opening as a draft for now.(Should be good now; I updated the benchmarks twice between what was mentioned above as well as a small optimization to make it use fixed-sized hash tables prior to partitioning like in the hash aggregate).