Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a "join" operator for simpler join execution #1237

Open
rcap107 opened this issue Feb 10, 2025 · 0 comments
Open

Introduce a "join" operator for simpler join execution #1237

rcap107 opened this issue Feb 10, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@rcap107
Copy link
Contributor

rcap107 commented Feb 10, 2025

Problem Description

The current implementation of Joiner and MultiAggJoiner has some limitations which are in part caused by the fact they need to follow the scikit-learn estimator template:

  • It implements only the left join. This makes sense as an estimator because the number of samples must remain constant. However, a user may expect to be able to perform any other kind of join (inner, outer, anti...) since that is the behavior of pandas or polars merge operators.
  • It is hard to put in production because the join tables are defined in the init and may change between the init and when the join is executed.
    ( @Vincent-Maladiere )
  • In general, the fit/transform structure makes it clunky to use if the user only needs it to perform multiple joins and does not care about putting it into a pipeline.

I think it would be useful to have a more lightweight "join operator" that implements the join without the constraints of the estimator.

Feature Description

Rather than the current implementation, a join_tables operator would look similar to this:

joined_table = skrub.join_tables(main_table, 
   aux_tables=[aux_table_1, aux_table_2, ...], 
   left_on=["key1", "key2"], 
   right_on=['"id1", "id2"], how="inner"
)

I am calling this an "operator" because it will operate directly on the given tables, and is stateless.

It should be possible to reuse most of the machinery that has already been implemented in the Joiners, so it should not be too complicated to implement.

Alternative Solutions

No response

Additional Context

No response

@rcap107 rcap107 added the enhancement New feature or request label Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant