Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple threshold clustering to linker #2617

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
5 changes: 4 additions & 1 deletion docs/api_docs/clustering.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
---
tags:
- API
- clustering
- Clustering
---

# Documentation for `splink.clustering`

Clustering at one or multiple thresholds is also available without a `linker` object:

::: splink.clustering
handler: python
options:
Expand Down
19 changes: 19 additions & 0 deletions docs/api_docs/linker_clustering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
tags:
- API
- Clustering
---

# Methods in Linker.clustering

Use the result of your Splink model to group (cluster) records together. Accessed via `linker.clustering`

::: splink.internals.linker_components.clustering.LinkerClustering
handler: python
filters:
- "!^__init__$"
options:
show_root_heading: false
show_root_toc: false
show_source: false
members_order: source
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ nav:
- Training: "api_docs/training.md"
- Visualisations: "api_docs/visualisations.md"
- Inference: "api_docs/inference.md"
- Clustering: "api_docs/clustering.md"
- Clustering: "api_docs/linker_clustering.md"
- Evaluation: "api_docs/evaluation.md"
- Table Management: "api_docs/table_management.md"
- Miscellaneous functions: "api_docs/misc.md"
Expand All @@ -151,6 +151,7 @@ nav:
- Exploratory: "api_docs/exploratory.md"
- Blocking rule creator: "api_docs/blocking.md"
- Blocking analysis: "api_docs/blocking_analysis.md"
- Clustering: "api_docs/clustering.md"
- SplinkDataFrame: "api_docs/splink_dataframe.md"
- EM Training Session API: "api_docs/em_training_session.md"
- SplinkDatasets: "api_docs/datasets.md"
Expand Down
10 changes: 8 additions & 2 deletions splink/clustering.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
from .internals.clustering import cluster_pairwise_predictions_at_threshold
from .internals.clustering import (
cluster_pairwise_predictions_at_multiple_thresholds,
cluster_pairwise_predictions_at_threshold,
)

__all__ = ["cluster_pairwise_predictions_at_threshold"]
__all__ = [
"cluster_pairwise_predictions_at_threshold",
"cluster_pairwise_predictions_at_multiple_thresholds",
]
2 changes: 1 addition & 1 deletion splink/internals/clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,7 @@ def cluster_pairwise_predictions_at_multiple_thresholds(
edge_id_column_name_left,
edge_id_column_name_right,
)

logger.info(f"--------Clustering at threshold {initial_threshold}--------")
# First cluster at the lowest threshold
cc = cluster_pairwise_predictions_at_threshold(
nodes=nodes_sdf,
Expand Down
Loading
Loading