-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sampling logic for density #1831
Conversation
…ervices/splink into strat_sample_by_density
I have tested this with the following script: Test scriptimport splink.duckdb.comparison_library as cl
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker
df = splink_datasets.historical_50k
settings_dict = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["postcode_fake", "first_name"]),
block_on(["first_name", "surname"]),
block_on(["dob", "substr(postcode_fake,1,2)"]),
block_on(["postcode_fake", "substr(dob,1,3)"]),
block_on(["postcode_fake", "substr(dob,4,5)"]),
],
"comparisons": [
cl.exact_match(
"first_name",
term_frequency_adjustments=True,
),
cl.jaro_winkler_at_thresholds(
"surname", distance_threshold_or_thresholds=[0.9, 0.8]
),
cl.levenshtein_at_thresholds(
"postcode_fake", distance_threshold_or_thresholds=[1, 2]
),
],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings_dict)
linker.estimate_u_using_random_sampling(target_rows=1e6)
linker.estimate_parameters_using_expectation_maximisation(
block_on(["first_name", "surname"])
)
linker.estimate_parameters_using_expectation_maximisation(
block_on(["dob", "substr(postcode_fake, 1,3)"])
)
df_e = linker.predict()
df_cluster = linker.cluster_pairwise_predictions_at_threshold(
df_e, threshold_match_probability=0.9
)
nm = linker._compute_metrics_nodes(
df_predict=df_e, df_clustered=df_cluster, threshold_match_probability=0.9
)
cm = linker._compute_metrics_clusters()
cm.as_pandas_dataframe()
linker.cluster_studio_dashboard(
df_e,
df_cluster,
out_path="temp_del.html",
sampling_method="lowest_density_clusters_by_size",
overwrite=True,
_df_cluster_metrics=cm,
) and it looks good to me. I've added a comment. I think I'll leave to andy to give a once over and final approval because he's closer to this work than i am |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, looks useful 👍
Just a couple of minor comments, but happy for you to go ahead and merge when you're happy to
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
This PR follows on from PR #1754, which introduced the option to view lowest density clusters in the cluster studio dashboard.
Give a brief description for the solution you have provided
In this PR the sampling logic is updated so that the lowest density clusters are sampled across clusters of different sizes, addressing the conversation here. This is done by diving the clusters into strata according to size, just like in the
_get_cluster_id_of_each_size()
method. And as is the default there, the number of rows in each strata ({rows_per_partition}
) is set to 1.Additionally:
min_nodes
) has been set to 3, as density for clusters of size 1 is undefined and is trivially 1 for clusters of size 2.cluster_id
within each strata (which I believe is in line with the logic for sampling by cluster size). This is to ensure the same 'lowest density clusters' are passed to the dashboard in the event the clustering is rerun (assuming that a cluster's ID hasn't been updated by new cluster membership orlen(cluster_ids) <= sample_size
which would otherwise trigger random sampling).PR Checklist