Update sampling logic for density #1831

zslade · 2024-01-09T17:22:01Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

This PR follows on from PR #1754, which introduced the option to view lowest density clusters in the cluster studio dashboard.

Give a brief description for the solution you have provided

In this PR the sampling logic is updated so that the lowest density clusters are sampled across clusters of different sizes, addressing the conversation here. This is done by diving the clusters into strata according to size, just like in the _get_cluster_id_of_each_size() method. And as is the default there, the number of rows in each strata ({rows_per_partition}) is set to 1.

Additionally:

A sensible default for the minimum cluster size (min_nodes) has been set to 3, as density for clusters of size 1 is undefined and is trivially 1 for clusters of size 2.
As well as ordering by density, I have imposed ordering by cluster_id within each strata (which I believe is in line with the logic for sampling by cluster size). This is to ensure the same 'lowest density clusters' are passed to the dashboard in the event the clustering is rerun (assuming that a cluster's ID hasn't been updated by new cluster membership or len(cluster_ids) <= sample_size which would otherwise trigger random sampling).
The density score is now displayed in the dashboard's dropdown menu besides the cluster ID. I've truncated the displayed density to 4dp to help with readbility (the full value is still used in calculations i.e. the ordering)
Cluster size could also be added to the menu alongside density but I've left it off for now as it might make the menu items too long/busy. Happy to add in if other think useful though
Updated test 🧪

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Updated CHANGELOG.md (if appropriate)
Made changes based off the latest version of Splink
Run the linter

…ervices/splink into strat_sample_by_density

RobinL · 2024-01-23T16:46:11Z

I have tested this with the following script:

Test script

import splink.duckdb.comparison_library as cl
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k

settings_dict = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["postcode_fake", "first_name"]),
        block_on(["first_name", "surname"]),
        block_on(["dob", "substr(postcode_fake,1,2)"]),
        block_on(["postcode_fake", "substr(dob,1,3)"]),
        block_on(["postcode_fake", "substr(dob,4,5)"]),
    ],
    "comparisons": [
        cl.exact_match(
            "first_name",
            term_frequency_adjustments=True,
        ),
        cl.jaro_winkler_at_thresholds(
            "surname", distance_threshold_or_thresholds=[0.9, 0.8]
        ),
        cl.levenshtein_at_thresholds(
            "postcode_fake", distance_threshold_or_thresholds=[1, 2]
        ),
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings_dict)

linker.estimate_u_using_random_sampling(target_rows=1e6)

linker.estimate_parameters_using_expectation_maximisation(
    block_on(["first_name", "surname"])
)

linker.estimate_parameters_using_expectation_maximisation(
    block_on(["dob", "substr(postcode_fake, 1,3)"])
)

df_e = linker.predict()

df_cluster = linker.cluster_pairwise_predictions_at_threshold(
    df_e, threshold_match_probability=0.9
)

nm = linker._compute_metrics_nodes(
    df_predict=df_e, df_clustered=df_cluster, threshold_match_probability=0.9
)
cm = linker._compute_metrics_clusters()
cm.as_pandas_dataframe()

linker.cluster_studio_dashboard(
    df_e,
    df_cluster,
    out_path="temp_del.html",
    sampling_method="lowest_density_clusters_by_size",
    overwrite=True,
    _df_cluster_metrics=cm,
)

and it looks good to me.

I've added a comment. I think I'll leave to andy to give a once over and final approval because he's closer to this work than i am

splink/cluster_studio.py

ADBond

Great, looks useful 👍
Just a couple of minor comments, but happy for you to go ahead and merge when you're happy to

splink/cluster_studio.py

zslade added 11 commits January 9, 2024 17:13

update sampling logic for density

7c44782

update doc string and variable names

8a541c7

update docstring

5810471

dashboard rendering and building correctly

6112136

display density value in dropdown menu

9c63e01

update test

33a4ee1

order by cluster id

9734f53

update test

7086ff3

Merge branch 'master' into strat_sample_by_density

d228b48

fix test

98f44cb

Merge branch 'strat_sample_by_density' of github.com:moj-analytical-s…

f11be14

…ervices/splink into strat_sample_by_density

zslade marked this pull request as ready for review January 11, 2024 16:43

zslade requested review from RobinL and ADBond January 11, 2024 16:43

RobinL reviewed Jan 23, 2024

View reviewed changes

splink/cluster_studio.py Outdated Show resolved Hide resolved

Merge branch 'master' into strat_sample_by_density

58a67fe

ADBond approved these changes Jan 29, 2024

View reviewed changes

splink/cluster_studio.py Outdated Show resolved Hide resolved

splink/cluster_studio.py Outdated Show resolved Hide resolved

splink/cluster_studio.py Outdated Show resolved Hide resolved

zslade added 6 commits January 29, 2024 10:25

Merge branch 'master' into strat_sample_by_density

f576105

Merge branch 'master' into strat_sample_by_density

716b683

update following review

8771516

update branch

c23a0ef

update following review

f9ac77d

update test

497a1f4

zslade merged commit b0c0b00 into master Jan 29, 2024
10 checks passed

zslade deleted the strat_sample_by_density branch January 29, 2024 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update sampling logic for density #1831

Update sampling logic for density #1831

zslade commented Jan 9, 2024 •

edited

Loading

RobinL commented Jan 23, 2024

ADBond left a comment

Update sampling logic for density #1831

Update sampling logic for density #1831

Conversation

zslade commented Jan 9, 2024 • edited Loading

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

RobinL commented Jan 23, 2024

ADBond left a comment

Choose a reason for hiding this comment

zslade commented Jan 9, 2024 •

edited

Loading