Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: streamed execution in MERGE #3145

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Jan 19, 2025

Description

This adds support for streamed execution in MERGE, with two caveats:

  • If CDF is enabled we have to materialize since a stream can only be consumed once. I could make it do this only in streaming mode, but I think it might make sense in general
  • Initially MERGE derives statistics about the source to prune the target a bit better and allow for more concurrent writers, the problem here is that deriving statistics about source means we have to consume source. Caching that early on will defeat the purpose of the streamed execution. So I've disabled this part of the early filter construction when streaming mode is on.

In the future we can look into the work of influxdb or lancedb. Especially the work of lancedb is interesting since it can allow for replay of the same stream, which would less stressful on memory when we have cdf enabled in merge. This should work well I think with this PR: #3142

I've actually based some of the work in here on top of this already: #3142, so that has to be merged first. Will also rewrite the commits then since they are messy now (:

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Jan 19, 2025
@ion-elgreco ion-elgreco marked this pull request as draft January 19, 2025 20:55
@ion-elgreco
Copy link
Collaborator Author

I broke CDF by projecting to early and removing the before state, but need to cache when everything is still there, but we can't cache a dataframe when there are table references

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant