Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype expansion of SQL transforms for single-node execution #59

Open
pabloem opened this issue Dec 8, 2022 · 5 comments
Open

Prototype expansion of SQL transforms for single-node execution #59

pabloem opened this issue Dec 8, 2022 · 5 comments

Comments

@pabloem
Copy link
Collaborator

pabloem commented Dec 8, 2022

One of the main targets for the Ray Beam Runner is to support SQL (and streaming SQL).

Beam's SQL support is implemented in Java. There are two parts for the execution of SQL transforms in Beam:

  • Expansion: The way Beam implements expansion of multi-language transforms is by implementing an ExpansionService interface (sample of the GRPC implementation - this seems way too complicated to be honest)

My idea:

  • Implement a class "RayJavaExpansionService" - that receives the expansion request that can be a relatively simple thing. It must contain:
    • Schema of the Input PCollection (what are schemas)
    • Identifier of the transform to apply (these ideantifiers are provided by SchemaTransformProvider implementations (see a few examples)
      • Note: I will implement a Sql one: SqlSchemaTransformProvider with id "beam:schematransform:org.apache.beam:sql:v1" this week.
    • Parameters for the transform (in this case, just the SQL statement)

The RayJavaExpansionService should then return the schema of the resulting PCollection, as well as the expanded graph of operations in protobuf format (the proto format).

  • Java dependencies:
    • "org.apache.beam:beam-sdks-java-core"
    • "org.apache.beam:beam-sdks-java-extensions-sql"

The expansion is not enough to execute SQL, but it's the first step. The next step is to recognize Java Stages, and execute them in a Java process rather than a Python process (basically, a Java implementation of this code, where we return some kind of JavaWorkerHandler

@pabloem
Copy link
Collaborator Author

pabloem commented Dec 8, 2022

Ray Java resources:

fyi @iasoon @valiantljk this issue is more complex than the other stuff you've tried, but it should help move one of our big features forward. is any of you interested? : )

@wilsonwang371
Copy link
Contributor

i don't fully understand this issue. Since you mentioned that this SQL transforms are done in Java. does this mean that we are adding java support for our beam runner?

@pabloem
Copy link
Collaborator Author

pabloem commented Dec 14, 2022

yes, we would have to add support for expanding java PTransforms. I think we can limit the scope of this quite a bit while still delivering SQL execution.

@wilsonwang371
Copy link
Contributor

yes, we would have to add support for expanding java PTransforms. I think we can limit the scope of this quite a bit while still delivering SQL execution.

this sounds cool, if we are also targeting java. I may ask my colleagues to take a look if he is interested to join us.

@wilsonwang371
Copy link
Contributor

@Evan2022TT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants