Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sampling logic for density #1831

Merged
merged 18 commits into from
Jan 29, 2024
Merged

Update sampling logic for density #1831

merged 18 commits into from
Jan 29, 2024

Conversation

zslade
Copy link
Contributor

@zslade zslade commented Jan 9, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

This PR follows on from PR #1754, which introduced the option to view lowest density clusters in the cluster studio dashboard.

Give a brief description for the solution you have provided

In this PR the sampling logic is updated so that the lowest density clusters are sampled across clusters of different sizes, addressing the conversation here. This is done by diving the clusters into strata according to size, just like in the _get_cluster_id_of_each_size() method. And as is the default there, the number of rows in each strata ({rows_per_partition}) is set to 1.

Additionally:

  • A sensible default for the minimum cluster size (min_nodes) has been set to 3, as density for clusters of size 1 is undefined and is trivially 1 for clusters of size 2.
  • As well as ordering by density, I have imposed ordering by cluster_id within each strata (which I believe is in line with the logic for sampling by cluster size). This is to ensure the same 'lowest density clusters' are passed to the dashboard in the event the clustering is rerun (assuming that a cluster's ID hasn't been updated by new cluster membership or len(cluster_ids) <= sample_size which would otherwise trigger random sampling).
  • The density score is now displayed in the dashboard's dropdown menu besides the cluster ID. I've truncated the displayed density to 4dp to help with readbility (the full value is still used in calculations i.e. the ordering)
  • Cluster size could also be added to the menu alongside density but I've left it off for now as it might make the menu items too long/busy. Happy to add in if other think useful though
  • Updated test 🧪

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@zslade zslade marked this pull request as ready for review January 11, 2024 16:43
@zslade zslade requested review from RobinL and ADBond January 11, 2024 16:43
@RobinL
Copy link
Member

RobinL commented Jan 23, 2024

I have tested this with the following script:

Test script
import splink.duckdb.comparison_library as cl
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k

settings_dict = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["postcode_fake", "first_name"]),
        block_on(["first_name", "surname"]),
        block_on(["dob", "substr(postcode_fake,1,2)"]),
        block_on(["postcode_fake", "substr(dob,1,3)"]),
        block_on(["postcode_fake", "substr(dob,4,5)"]),
    ],
    "comparisons": [
        cl.exact_match(
            "first_name",
            term_frequency_adjustments=True,
        ),
        cl.jaro_winkler_at_thresholds(
            "surname", distance_threshold_or_thresholds=[0.9, 0.8]
        ),
        cl.levenshtein_at_thresholds(
            "postcode_fake", distance_threshold_or_thresholds=[1, 2]
        ),
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings_dict)

linker.estimate_u_using_random_sampling(target_rows=1e6)

linker.estimate_parameters_using_expectation_maximisation(
    block_on(["first_name", "surname"])
)

linker.estimate_parameters_using_expectation_maximisation(
    block_on(["dob", "substr(postcode_fake, 1,3)"])
)

df_e = linker.predict()

df_cluster = linker.cluster_pairwise_predictions_at_threshold(
    df_e, threshold_match_probability=0.9
)

nm = linker._compute_metrics_nodes(
    df_predict=df_e, df_clustered=df_cluster, threshold_match_probability=0.9
)
cm = linker._compute_metrics_clusters()
cm.as_pandas_dataframe()

linker.cluster_studio_dashboard(
    df_e,
    df_cluster,
    out_path="temp_del.html",
    sampling_method="lowest_density_clusters_by_size",
    overwrite=True,
    _df_cluster_metrics=cm,
)

and it looks good to me.

I've added a comment. I think I'll leave to andy to give a once over and final approval because he's closer to this work than i am

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks useful 👍
Just a couple of minor comments, but happy for you to go ahead and merge when you're happy to

@zslade zslade merged commit b0c0b00 into master Jan 29, 2024
10 checks passed
@zslade zslade deleted the strat_sample_by_density branch January 29, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants