-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splink 4 dialect agnostic blocking rules #1890
Conversation
…R to test_preceding_blocking_rules
To do next: Split up creator file and library file |
|
||
self.base_dialect_str = base_dialect_str | ||
|
||
def create_sql(self, sql_dialect: SplinkDialect) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've copied this across rather than attempt to re-use the code from the CustomLevel
That was deliberate, to reduce indirection, but i don't feel particularly strongly about this
_clause = "AND" | ||
|
||
|
||
class Or(_Merge): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left Or
in to get tests passing, but i think there's an argument for dropping support for it: it's very rare you want to use an OR in a blocking rule because it usually results in a full cartesian product of the input dataframe being created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, this works nicely and fits in well with the rest of the stuff we've been doing. Very nice being able to write stuff like block_on(ColumnExpression("surname").substr(1, 4))
def block_on( | ||
*col_names_or_exprs: Union[str, ColumnExpression], | ||
salting_partitions=None, | ||
arrays_to_explode=None, | ||
) -> BlockingRuleCreator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be helpful to give an error here if people use the old syntax, given that this is possibly a more subtle change compared to other API tweaks?
For instance currently if you use block_on(["first_name", "surname"])
you will hit AttributeError: 'NoneType' object has no attribute 'sql_dialect'
at some point once the linker is doing its thing, so it might be handy to nudge users in the right direction rather than them trying to unpick that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - done
In this PR, I have tried to mirror the work on ComparisonLevelCreator, but for blocking rules.
Hence the user passes BlockingRuleCreator objects into the settings dictionary. These are only turned into actual BlockingRules via blocking_rule_to_obj once the dialect is known via the DB API.
Example script