Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database API - updates and test conformance #1875

Merged
merged 485 commits into from
Jan 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
485 commits
Select commit Hold shift + click to select a range
65ed77b
split metric method in twain and return a kind-of gross tuple
ADBond Dec 15, 2023
e5abccf
return a dict instead of tuple
ADBond Dec 15, 2023
25cf3f3
capture singleton nodes by joining clusters table
ADBond Dec 15, 2023
c4efefe
dialect-agnostic cluster metric test
ADBond Dec 15, 2023
7ff28a7
sqlite wants to keep things inty, so force to float
ADBond Dec 15, 2023
0c0cf17
consistent types for test
ADBond Dec 15, 2023
6233aeb
formatting
ADBond Dec 15, 2023
40232f6
use node metrics to compute size + density
ADBond Dec 15, 2023
f86b77a
compute cluster centralisation also
ADBond Dec 15, 2023
37d4f20
Changed argument name to lowest_density_clusters
zslade Dec 15, 2023
bf9e81b
Merge branch 'master' into sample_by_density
zslade Dec 15, 2023
03aad78
lint
zslade Dec 15, 2023
246dae9
Merge branch 'sample_by_density' of github.com:moj-analytical-service…
zslade Dec 15, 2023
5c95b77
centralisation also coerce to float
ADBond Dec 15, 2023
17b2612
test centralisation also
ADBond Dec 15, 2023
0fc5528
adjust test logic to fit with all backends
ADBond Dec 15, 2023
713293e
linting
ADBond Dec 15, 2023
f406719
typing, comments, and docstrings
ADBond Dec 15, 2023
9f10d9c
get rid of obsolete arguments
ADBond Dec 15, 2023
3298d16
Update function name
zslade Dec 18, 2023
98e4221
Update function name
zslade Dec 18, 2023
8e501d1
lint with black
zslade Dec 18, 2023
398d1ba
Merge branch 'master' into faster_duckdb
RobinL Dec 18, 2023
d4a9e9f
Merge branch 'splink4_dev' into migrate-tests
ADBond Dec 18, 2023
b2cbbe0
list -> typing.List
ADBond Dec 18, 2023
fef5378
lowercase in new style
ADBond Dec 18, 2023
9ab8600
further broken links
RossKen Dec 18, 2023
bf26035
small fixes
RossKen Dec 18, 2023
2913746
Merge pull request #1805 from moj-analytical-services/broken_links
RossKen Dec 18, 2023
c36be2d
handle tf adjustment column being str or ColumnExpression
ADBond Dec 18, 2023
a22eae6
simple repr for ColumnExpression for time being
ADBond Dec 18, 2023
2829c74
regex extract tests implemented
ADBond Dec 18, 2023
5f4183a
lint with black
ADBond Dec 18, 2023
ec83fd5
comment out test that we may not keep
ADBond Dec 18, 2023
b88173a
cast date newstyle
ADBond Dec 18, 2023
bf1a3b5
naming + kwargs
ADBond Dec 18, 2023
c75312e
fix how logic works in new style
ADBond Dec 19, 2023
815e28f
keyword args
ADBond Dec 19, 2023
9405a6c
Merge branch 'splink4_dev' into migrate-tests
ADBond Dec 19, 2023
981de52
relabel comparison with new casing
ADBond Dec 19, 2023
efdee7d
new comparisons translated
ADBond Dec 19, 2023
f2729f5
charts function no longer takes dialect
ADBond Dec 19, 2023
058af05
rename cl
ADBond Dec 19, 2023
b4437f0
Merge branch 'master' into sample_by_density
zslade Dec 19, 2023
058f490
Update tests/test_cluster_studio.py
zslade Dec 19, 2023
3a26d9e
Update tests/test_cluster_studio.py
zslade Dec 19, 2023
ac1cf23
Merge pull request #1754 from moj-analytical-services/sample_by_density
zslade Dec 19, 2023
9678a9c
(temp) custom array for duckdb
ADBond Dec 19, 2023
3301692
lint with black
ADBond Dec 19, 2023
5de2597
give levels an `is_exact_match_level` property, false by default
ADBond Dec 19, 2023
20de946
term frequencies on via property
ADBond Dec 19, 2023
b2f506f
term_frequency_adjustments for comparisons
ADBond Dec 19, 2023
5863318
Merge branch 'comparison-tf-adjustments' into migrate-tests-compariso…
ADBond Dec 19, 2023
8baa1f1
comparison tf adjustments translated
ADBond Dec 19, 2023
431c06b
remove special handling for ExactMatch
ADBond Dec 19, 2023
deaa548
Merge branch 'comparison-tf-adjustments' into migrate-tests-compariso…
ADBond Dec 19, 2023
3679e7b
fix syntax for ExactMatch in test
ADBond Dec 19, 2023
df00a73
+kwarg that no longer has default
ADBond Dec 19, 2023
734643f
remove ghost expressions :ghost:
ADBond Dec 19, 2023
ebb331d
build levels in `Linker.load_settings()`
ADBond Dec 19, 2023
922b013
ComparisonCreator - deal with list of `col_expressions` internally bu…
ADBond Dec 19, 2023
42573e6
DistanceInKMAtThresholds comparison
ADBond Dec 19, 2023
497962b
Merge branch 'distance-km-thresh-comparison' into migrate-tests-km-at…
ADBond Dec 19, 2023
57ceab6
translate comparisons, add helper arguments
ADBond Dec 19, 2023
4c96f5d
lint with black
ADBond Dec 20, 2023
815ea31
lint with black
ADBond Dec 20, 2023
325e4ba
Merge branch 'db-api' into tests-db-api
ADBond Dec 20, 2023
e797bd8
remove defaults for DistanceInKMAtThresholds
ADBond Dec 21, 2023
689e0d6
ComparisonCreator - column expressions as lists or strings
ADBond Dec 21, 2023
1e7a269
Merge pull request #1815 from moj-analytical-services/distance-km-thr…
ADBond Dec 21, 2023
afd6144
Merge pull request #1816 from moj-analytical-services/migrate-tests-k…
ADBond Dec 21, 2023
1eabe9e
Merge branch 'splink4_dev' into migrate-tests
ADBond Dec 21, 2023
77a588a
duckdb test using db api
ADBond Dec 21, 2023
67671cd
adjust test helper for db api
ADBond Dec 21, 2023
0c89132
Merge branch 'db-api' into tests-db-api
ADBond Dec 21, 2023
b13fa46
instantiate linker with simple settings
ADBond Dec 21, 2023
7a1f02e
Merge branch 'migrate-tests' into tests-db-api
ADBond Dec 21, 2023
605f633
Merge branch 'db-api' into tests-db-api
ADBond Dec 21, 2023
4ab3028
search-and-replace for new linker + api syntax
ADBond Dec 21, 2023
057f60c
fix imports + formatting
ADBond Dec 21, 2023
fe0b7bb
Merge branch 'db-api' into tests-db-api
ADBond Dec 21, 2023
fab98ce
mock db-api method instead of linker
ADBond Dec 21, 2023
d446596
fix tests that got missed
ADBond Dec 21, 2023
616bf7d
fix handling of a few more tests
ADBond Dec 21, 2023
56cca08
db_api has the connection now
ADBond Dec 21, 2023
71a6d05
linting
ADBond Dec 21, 2023
748efe9
rename graph metrics function to more clearly align with what it does
ADBond Dec 22, 2023
fc2b3f9
clearer variable name
ADBond Dec 22, 2023
c7a6d53
some docstrings for internal graph metric functions
ADBond Dec 22, 2023
ec4e8f7
correct docstring wording
ADBond Dec 22, 2023
1219c3c
adjust SQL formatting
ADBond Dec 22, 2023
d05036d
switch dict layout of metrics tables
ADBond Dec 22, 2023
939768d
Revert "handle tf adjustment column being str or ColumnExpression"
ADBond Dec 22, 2023
030abdd
convert any level_dict values from `ColumnExpression` to string
ADBond Dec 22, 2023
ed0fb3f
convert any level_dict values from `ColumnExpression` to string
ADBond Dec 22, 2023
d5c378e
Merge pull request #1818 from moj-analytical-services/comparison-leve…
ADBond Dec 22, 2023
2e27af4
don't need max degree
ADBond Dec 22, 2023
4b3afb6
keep table names direct
ADBond Dec 22, 2023
5a1d186
rename var for clarity
ADBond Dec 22, 2023
20d195d
rename physical tables in line with updated method name
ADBond Dec 22, 2023
b078468
Merge branch 'db-api' into tests-db-api
ADBond Dec 22, 2023
6832710
basic null-level validation logic ported
ADBond Dec 22, 2023
20c8fa7
Merge pull request #1819 from moj-analytical-services/null-level-vali…
ADBond Dec 23, 2023
dec3078
Add duckdb salting based on max_pairs
RobinL Dec 24, 2023
1d8b64b
Refactor _get_duckdb_salting to double the returned value
RobinL Dec 24, 2023
0c46720
revert change that doubled cpus. was only used for benchmarking
RobinL Jan 4, 2024
b80e589
Merge branch 'db-api-spark' into tests-db-api-spark
ADBond Jan 8, 2024
a199b87
+ test helper arg
ADBond Jan 8, 2024
b1f4827
Merge branch 'migrate-tests' into tests-db-api-spark
ADBond Jan 8, 2024
b492679
spark_api fixture and use in test
ADBond Jan 8, 2024
4a2c5a1
udf test to new api
ADBond Jan 9, 2024
271961a
test cl - m probs to configure
ADBond Jan 9, 2024
0dae6e7
Merge branch 'splink4_dev' into migrate-tests
ADBond Jan 9, 2024
9c7ccc1
postgres try_parse_date + fix datediff to work with ColumnExpression …
ADBond Jan 9, 2024
8c96ad7
concat -> ||
ADBond Jan 9, 2024
c733de6
lint with black
ADBond Jan 9, 2024
cc47fe1
Merge branch 'db-api-spark' into tests-db-api-spark
ADBond Jan 9, 2024
9f24dff
improve docstring wording
ADBond Jan 10, 2024
e74f12f
correct description of return type in docstring
ADBond Jan 10, 2024
e241004
Merge pull request #1806 from moj-analytical-services/node-degree
ADBond Jan 10, 2024
535e4c8
Merge branch 'db-api-spark' into tests-db-api-spark
ADBond Jan 10, 2024
980eccc
full spark test to new api
ADBond Jan 10, 2024
69d06a1
Merge branch 'db-api-spark' into tests-db-api-spark
ADBond Jan 10, 2024
0b95034
Refactor blocking and prediction SQL queries
RobinL Jan 10, 2024
fc683ed
Remove unnecessary blank line in SaltedBlockingRule class
RobinL Jan 10, 2024
0bca602
Update estimate_u.py: Import multiprocessing and remove unused function
RobinL Jan 10, 2024
7168154
Update cvv_hashed_tablename in test_correctness_of_convergence.py
RobinL Jan 10, 2024
cc3ac68
Update changelog
RobinL Jan 10, 2024
5c07616
Refactor blocking rule initialization in EMTrainingSession and Linker…
RobinL Jan 10, 2024
6e7f760
Merge pull request #1796 from moj-analytical-services/faster_duckdb
RobinL Jan 10, 2024
87fc2de
Merge branch 'master' into parallel_em_training
RobinL Jan 10, 2024
d83b6a6
salting test to new api
ADBond Jan 10, 2024
886a78e
update changelog
RobinL Jan 10, 2024
9417bcc
convert last SparkLinker tests to new api
ADBond Jan 10, 2024
0309b30
annotate tests with appropriate backend decorations
ADBond Jan 10, 2024
24ed793
ctl test to backend agnostic framework + rename ctl functions
ADBond Jan 11, 2024
74bcd02
adjust for simplified api + fix corresponding gamma levels
ADBond Jan 11, 2024
a4f8d40
null level - if we are using a regex or date transform we must also c…
ADBond Jan 11, 2024
280e026
initial EmailComparison translated
ADBond Jan 11, 2024
66dd487
Merge branch 'master' into fix_convergence_test
RobinL Jan 12, 2024
c265a36
Merge pull request #1798 from moj-analytical-services/fix_convergence…
RobinL Jan 12, 2024
805371b
email comparison description, and force optional args to be keyword
ADBond Jan 12, 2024
3e74e95
Merge pull request #1832 from moj-analytical-services/parallel_em_tra…
RobinL Jan 12, 2024
b5d1b7c
update python version + pin poetry version for lint/autoblack workflows
ADBond Jan 12, 2024
c8fd7be
pin all unpinned workflows to poetry 1.7.0
ADBond Jan 12, 2024
f058532
lint with black
ADBond Jan 12, 2024
f128871
Merge pull request #1836 from ADBond/fix-linting-workflows
ADBond Jan 12, 2024
a8d48ac
custom datdiff function for spark dialect
ADBond Jan 12, 2024
813e947
fix description, nice error if invalid metric
ADBond Jan 12, 2024
934705e
test for ctl functions - only EmailComparison currently
ADBond Jan 12, 2024
24e41e0
make ComparisonCreator init arg compulsory, note exception in custom …
ADBond Jan 15, 2024
7786f8f
change col_expressions property to be a dict to make it more readable…
ADBond Jan 15, 2024
675fdd8
Include capture_group passing to ColumnExpression.regex_extract
ADBond Jan 15, 2024
f69d31e
add regex_extract implementation for duckdb, spark, and a limited ver…
ADBond Jan 15, 2024
67fca98
Merge pull request #1837 from moj-analytical-services/fix-datediff-spark
ADBond Jan 16, 2024
a33b7e8
Merge pull request #1834 from moj-analytical-services/null-level-patt…
ADBond Jan 16, 2024
92c5052
don't need cll and cl to go via test helper anymore!
ADBond Jan 16, 2024
bb3f55f
exclude sqlite + postgres from ctl tests
ADBond Jan 16, 2024
499f0e1
lint with black
ADBond Jan 16, 2024
4b14546
Merge branch 'splink4_dev' into regexp-extract-capture-group
RobinL Jan 16, 2024
eaba210
change col_expressions property to be a dict to make it more readable…
RobinL Jan 16, 2024
2522fb0
Merge branch 'master' into merge_in_master
RobinL Jan 16, 2024
e2d3fdf
Merge pull request #1846 from moj-analytical-services/merge_in_master
RobinL Jan 16, 2024
59f02fc
Merge pull request #1845 from moj-analytical-services/fix_lat_lng
RobinL Jan 16, 2024
f93fc19
Merge pull request #1842 from moj-analytical-services/comparison-crea…
ADBond Jan 16, 2024
0867861
Merge pull request #1840 from moj-analytical-services/email-comparison
ADBond Jan 16, 2024
b54278c
Merge pull request #1828 from moj-analytical-services/postgres-datedi…
ADBond Jan 16, 2024
3c924f6
Merge branch 'splink4_dev' into migrate-tests
ADBond Jan 16, 2024
1ece5d7
Merge branch 'master' into refactor_ids_to_compare_creation
RobinL Jan 17, 2024
ca0e202
Move out of loop
RobinL Jan 17, 2024
288e8d1
exact match level - tf adjustments pass raw sql to be dealt with down…
ADBond Jan 17, 2024
d717d36
explanatory comment for leaving raw values in dict
ADBond Jan 17, 2024
3421cc8
lint with black
ADBond Jan 17, 2024
2661b08
Merge pull request #1692 from moj-analytical-services/refactor_ids_to…
RobinL Jan 17, 2024
d16fc2c
Add ability to block on array columns
RobinL Jan 17, 2024
a367f4a
Merge pull request #1847 from moj-analytical-services/update_changelog
RobinL Jan 17, 2024
6d0a64e
quick test highlighting issue with empty strings
ADBond Jan 17, 2024
a83c867
Merge branch 'master' into splink4_dev
RobinL Jan 17, 2024
5671bb8
wrap regex_extract in `nullif` to guard for empty strings
ADBond Jan 17, 2024
21f05dc
nullif wrapping try_parse_date also, use a helper function
ADBond Jan 17, 2024
b589bfe
test for null level validation
ADBond Jan 17, 2024
c248ca0
remove explicit null level empty string checking now that it is cover…
ADBond Jan 17, 2024
467e66e
Merge branch 'splink4_dev' into regexp-extract-capture-group
ADBond Jan 17, 2024
804589a
lint with black
ADBond Jan 17, 2024
dff4c6f
Merge pull request #1844 from moj-analytical-services/regexp-extract-…
ADBond Jan 17, 2024
8339efb
Merge pull request #1812 from moj-analytical-services/comparison-tf-a…
ADBond Jan 17, 2024
ab3f59f
Merge branch 'migrate-tests' into migrate-tests-comparison-tf
ADBond Jan 17, 2024
e1a87cd
Merge pull request #1813 from moj-analytical-services/migrate-tests-c…
ADBond Jan 17, 2024
2697523
Merge branch 'splink4_dev' into migrate-tests
ADBond Jan 17, 2024
f1682ba
lint with black
ADBond Jan 17, 2024
9a26348
helper.cl -> imported cl
ADBond Jan 17, 2024
8ff82a9
linting
ADBond Jan 17, 2024
38c070b
exclude incompatible backends from tests
ADBond Jan 17, 2024
f6322cc
tf from param -> configure with columns reversed
ADBond Jan 17, 2024
9744913
adjust tests for DateComparison
ADBond Jan 18, 2024
b44da75
adjust test for NameComparison (with new defaults)
ADBond Jan 18, 2024
1eda053
postcode comparison test update name
ADBond Jan 18, 2024
aa5c551
fornamesurname test adjust gamma name
ADBond Jan 18, 2024
18ccc87
adjust email comparison to be consistent with remaining future ctl fu…
ADBond Jan 18, 2024
99c5dd8
+ PostcodeComparison
ADBond Jan 18, 2024
3559902
+ ctl.DateComparison
ADBond Jan 18, 2024
764f292
ctl.NameComparison
ADBond Jan 18, 2024
7bdfe9c
ctl.ForenameSurnameComparison
ADBond Jan 18, 2024
1ab8991
linting
ADBond Jan 18, 2024
6737ae3
fix test call
ADBond Jan 18, 2024
59811d8
quicktests of a few ctl functions
ADBond Jan 18, 2024
0401696
remove false parameter
ADBond Jan 18, 2024
e68ee7d
fix column names and cast to date
ADBond Jan 18, 2024
7328e9f
lint with black
ADBond Jan 18, 2024
5636bed
Merge pull request #1856 from moj-analytical-services/splink4-ctl
ADBond Jan 19, 2024
f5e4527
Merge pull request #1857 from moj-analytical-services/migrate-tests-ctl
ADBond Jan 19, 2024
7d5d13a
Merge branch 'splink4_dev' into migrate-tests
ADBond Jan 19, 2024
0356c64
exclude sqlite from regex tests
ADBond Jan 19, 2024
95c0800
rename distance levels
ADBond Jan 19, 2024
0a2db69
+ DistanceFunctionLevel
ADBond Jan 19, 2024
61cb160
DistanceFunctionAtThresholds
ADBond Jan 19, 2024
2f3aeee
lint with black
ADBond Jan 19, 2024
99c0793
change expected error type
ADBond Jan 19, 2024
fadf617
make 'distance_threshold' validation generic numeric validation
ADBond Jan 19, 2024
2b93d2b
datediff and arrayintersect levels numerical validation
ADBond Jan 19, 2024
dbf9cf6
categorical validation - datediff metrics
ADBond Jan 19, 2024
e98eec5
basic type check numerical validation
ADBond Jan 19, 2024
dae328b
ComparisonCreator - validate in __init__ by creating levels
ADBond Jan 19, 2024
548145a
DatediffAtThresholds + DateComparison - validate threshold + metric a…
ADBond Jan 19, 2024
f6c449d
lint with black
ADBond Jan 19, 2024
8a7648b
fix test name
ADBond Jan 19, 2024
38141f0
remove aspect of test that postgres can't handle
ADBond Jan 19, 2024
c8a91de
lint with black
ADBond Jan 19, 2024
4dd7191
postgres udf to cast to date or return NULL if impossible
ADBond Jan 19, 2024
f1333ab
PostgresDialect use custom date-casting function
ADBond Jan 19, 2024
6bfbd0d
Add LiteralMatchLevel class to ComparisonLevelLibrary
RobinL Jan 22, 2024
156efff
Merge pull request #1863 from moj-analytical-services/fix-postgres-da…
ADBond Jan 22, 2024
9f63045
Merge pull request #1860 from moj-analytical-services/custom-distance…
ADBond Jan 22, 2024
fe45feb
Merge pull request #1861 from moj-analytical-services/comparison-vali…
ADBond Jan 22, 2024
f0c63d9
Merge branch 'splink4_dev' into migrate-tests
ADBond Jan 22, 2024
aa522a2
Add support for different literal datatypes in LiteralMatchLevel
RobinL Jan 22, 2024
686140b
Merge branch 'splink4_dev' into literal_match_level
RobinL Jan 22, 2024
a2ba28b
Refactor literal datatype validation in LiteralMatchLevel
RobinL Jan 22, 2024
ca8f9d0
fix 1st jan
RobinL Jan 22, 2024
d486e9f
Merge pull request #1870 from moj-analytical-services/fix_jan_1st
RobinL Jan 22, 2024
42b593d
Merge pull request #1714 from moj-analytical-services/migrate-tests
ADBond Jan 22, 2024
6cf4b2e
Merge branch 'splink4_dev' into literal_match_level
RobinL Jan 22, 2024
a22b9a3
Merge pull request #1869 from moj-analytical-services/literal_match_l…
RobinL Jan 22, 2024
89e6064
Merge branch 'splink4_dev' into db-api-all-tests
ADBond Jan 23, 2024
6a324d0
tests to splink4 compatibility
ADBond Jan 23, 2024
f5ef657
move _explode_arrays_sql to live on dialects
ADBond Jan 23, 2024
09fd30e
mark some dialect tests
ADBond Jan 23, 2024
efb0ffa
fix test call
ADBond Jan 23, 2024
fa534d6
skip (possibly) no-longer-relevant test
ADBond Jan 23, 2024
4f41605
linting
ADBond Jan 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/auto_update_script_contents.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
9 changes: 7 additions & 2 deletions .github/workflows/autoblack.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
name: autoblack
on: [pull_request]

env:
PYTHON_VERSION: "3.12.1"

jobs:
build:
runs-on: ubuntu-latest
Expand All @@ -9,10 +13,10 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Set up Python ${{ matrix.python-version }}
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
python-version: ${{ env.PYTHON_VERSION }}

- name: Load cached Poetry installation
uses: actions/cache@v2
Expand All @@ -22,6 +26,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
12 changes: 6 additions & 6 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
name: Lint
on: [pull_request]

env:
PYTHON_VERSION: "3.12.1"

jobs:
build:
runs-on: ubuntu-latest
strategy:
max-parallel: 4
matrix:
python-version: [3.8]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
python-version: ${{ env.PYTHON_VERSION }}

- name: Load cached Poetry installation
uses: actions/cache@v2
Expand All @@ -24,6 +23,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/poetry_pypi_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ jobs:
- uses: actions/checkout@v3
with:
ref: master
token: ${{ secrets.SPLINK_TOKEN }}
- name: Install poetry
run: pipx install poetry
- uses: actions/setup-python@v4
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pytest_benchmark_comment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pytest_benchmark_commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/run_demos_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/run_demos_tutorials.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
22 changes: 21 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## Unreleased

### Added

- Ability to block on array columns by specifying `arrays_to_explode` in your blocking rule. ([#1692](https://github.com/moj-analytical-services/splink/pull/1692))

### Changed

- Splink now fully parallelises data linkage when using DuckDB ([#1796](https://github.com/moj-analytical-services/splink/pull/1796))

### Fixed

- Allow salting in EM training ([#1832](https://github.com/moj-analytical-services/splink/pull/1832))

## [3.9.10] - 2023-12-07

### Changed

- Remove unused code from Athena linker ([#1775](https://github.com/moj-analytical-services/splink/pull/1775))
- Add argument for `register_udfs_automatically` ([#1774](https://github.com/moj-analytical-services/splink/pull/1774))

### Fixed

- Fixed issue with `_source_dataset_col` and `_source_dataset_input_column` ([#1731](https://github.com/moj-analytical-services/splink/pull/1731))
- Delete cached tables before resetting the cache ([#1752](https://github.com/moj-analytical-services/splink/pull/1752)

## [3.9.9] - 2023-11-14

Expand Down Expand Up @@ -46,6 +65,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Corrected path for Spark `.jar` file containing UDFs to work correctly for Spark < 3.0 ([#1622](https://github.com/moj-analytical-services/splink/pull/1622))
- Spark UDF `damerau_levensthein` is now only registered for Spark >= 3.0, as it is not compatible with earlier versions ([#1622](https://github.com/moj-analytical-services/splink/pull/1622))

[unreleased]: https://github.com/moj-analytical-services/splink/compare/3.9.9...HEAD
[unreleased]: https://github.com/moj-analytical-services/splink/compare/3.9.10...HEAD
[3.9.10]: https://github.com/moj-analytical-services/splink/compare/v3.9.9...3.9.10
[3.9.9]: https://github.com/moj-analytical-services/splink/compare/v3.9.8...3.9.9
[3.9.8]: https://github.com/moj-analytical-services/splink/compare/v3.9.7...v3.9.8
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Contributions to Splink are not limited to the code. Feedback and input on our d

Behind the scenes, the Splink documentation is split into 2 parts:

- The [Tutorials](./docs/demos/00_Tutorial_Introduction.ipynb) and [Example Notebooks](./docs/examples_index.md) are stored in a separate repo - [splink_demos](https://github.com/moj-analytical-services/splink_demos)
- The [Tutorials](./docs/demos/tutorials/00_Tutorial_Introduction.ipynb) and [Example Notebooks](./docs/demos/examples/examples_index.md) are stored in a separate repo - [splink_demos](https://github.com/moj-analytical-services/splink_demos)
- Everything else is stored in the Splink repo either in:
- the [docs folder](https://github.com/moj-analytical-services/splink/tree/master/docs)
- the Splink code itself. E.g. docstrings from [linker.py](https://github.com/moj-analytical-services/splink/blob/master/splink/linker.py) feed directly into the [Linker API docs](./docs/linker.md).
Expand Down
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Should you require a more bare-bones version of Splink **without DuckDB**, pleas

The following code demonstrates how to estimate the parameters of a deduplication model, use it to identify duplicate records, and then use clustering to generate an estimated unique person ID.

For more detailed tutorial, please see [here](https://moj-analytical-services.github.io/splink/demos/00_Tutorial_Introduction.html).
For more detailed tutorial, please see [here](https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html).

```py
from splink.duckdb.linker import DuckDBLinker
Expand Down Expand Up @@ -166,13 +166,16 @@ To find the best place to ask a question, report a bug or get general advice, pl

## Awards

🥇 Analysis in Government Awards 2020: Innovative Methods: [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)

🥇 MoJ DASD Awards 2020: Innovation and Impact - Winner

🥇 Analysis in Government Awards 2022: People's Choice Award - [Winner](https://analysisfunction.civilservice.gov.uk/news/announcing-the-winner-of-the-first-analysis-in-government-peoples-choice-award/)

🥈 Analysis in Government Awards 2022: Innovative Methods [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)
🥈 Analysis in Government Awards 2022: Innovative Methods - [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)

🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https://www.civilserviceawards.com/best-use-of-data-science-and-technology-award-2/)


## Citation

Expand Down
4 changes: 2 additions & 2 deletions docs/blocking_rule_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ toc_depth: 2
---
# Documentation for `blocking_rules_library`

The `blocking_rules_library` contains a series of pre-made blocking rules available for use in the construction of blocking rule strategies and em training blocks [as described in this topic guide](./topic_guides/drivers_of_performance.html#blocking-rules).
The `blocking_rules_library` contains a series of pre-made blocking rules available for use in the construction of blocking rule strategies and em training blocks [as described in this topic guide](./topic_guides/blocking/blocking_rules.md).

These conform to a more performant standard that is outlined in detail [here](./topic_guides/drivers_of_performance.html#blocking-rules).
These conform to a more performant standard that is outlined in detail [here](./topic_guides/performance/drivers_of_performance.html#blocking-rules).


The detailed API for each of these are outlined below.
Expand Down
17 changes: 9 additions & 8 deletions docs/blog/.authors.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
robin-l:
name: Robin Linacre
description: Creator
avatar: https://github.com/robinl.png
authors:
robin-l:
name: Robin Linacre
description: Creator
avatar: https://github.com/robinl.png

ross-k:
name: Ross Kennedy
description: Maintainer
avatar: https://github.com/rossken.png
ross-k:
name: Ross Kennedy
description: Maintainer
avatar: https://github.com/rossken.png
2 changes: 1 addition & 1 deletion docs/blog/posts/2023-07-27-feature_update.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2022-07-27
date: 2023-07-27
authors:
- ross-k
- robin-l
Expand Down
119 changes: 119 additions & 0 deletions docs/blog/posts/2023-12-06-feature_update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
date: 2023-12-06
authors:
- ross-k
categories:
- Feature Updates
---

# Splink Updates - December 2023

Welcome to the second installment of the Splink Blog!

Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!

<!-- more -->

Latest Splink version: [v3.9.10](https://github.com/moj-analytical-services/splink/releases/tag/v3.9.10)

## :bar_chart: Charts Gallery

The Splink docs site now has a [Charts Gallery](../../charts/index.md) to show off all of the charts that come out-of-the-box with Splink to make linking easier.

[![](../posts/img/charts_gallery.png){ width="400" }](../../charts/index.md)

Each chart now has an explanation of:

1. What the chart shows
2. How to interpret it
3. Actions to take as a result

This is the first step on a longer term journey to provide more guidance on how to evaluate Splink models and linkages, so watch this space for more in the coming months!

## :chart_with_upwards_trend: New Charts

We are always adding more charts to Splink - to understand how these charts are built see our new [Charts Developer Guide](../../dev_guides/charts/understanding_and_editing_charts.md).

Two of our latest additions are:

### :material-matrix: Confusion Matrix

When evaluating any classification model, a confusion matrix is a useful tool for summarizing performance by representing counts of true positive, true negative, false positive, and false negative predictions.

Splink now has its own [confusion matrix chart](../../charts/confusion_matrix_from_labels_table.ipynb) to show how model performance changes with a given match weight threshold.

[![](./img/confusion_matrix.png){ width="400" }](../../charts/confusion_matrix_from_labels_table.ipynb)

Note, labelled data is required to generate this chart.

### :material-table: Completeness Chart

When linking multiple datasets together, one of the most important factors for a successful linkage is the number of common fields across the datasets.

Splink now has the [completeness chart](../../charts/completeness_chart.ipynb) which gives a simple view of how well populated fields are across datasets.

[![](./img/completeness_chart.png)](../../charts/completeness_chart.ipynb)


## :clipboard: Settings Validation

The [Settings dictionary](../../settings_dict_guide.md) is central to everything in Splink. It defines everything from the sql dialect of your backend to how features are compared in Splink model.

A common sticking point with users is how easy it is to make small errors when defining the Settings dictionary, resulting in unhelpful error messages.

To address this issue, the [Settings Validator](../../dev_guides/settings_validation/settings_validation_overview.md) provides clear, user-friendly feedback on what the issue is and how to fix it.


## :simple-adblock: Blocking Rule Library (Improved)

In our [previous blog](../posts/2023-12-06-feature_update.md#no_entry_sign-drop-support-for-python-37) we introduced the Blocking Rule Library (BRL) built upon the `exact_match_rule` function. When testing this functionality we found it pretty verbose, particularly when blocking on multiple columns, so figured we could do better. From Splink v3.9.6 we introduced the `block_on` function to supercede `exact_match_rule`.

For example, a block on `first_name` and `surname` now looks like:

```py
from splink.duckdb.blocking_rule_library import block_on
block_on(["first_name", "surname"])
```

as opposed to

```py
import splink.duckdb.blocking_rule_library as brl
brl.and_(
brl.exact_match_rule("first_name"),
brl.exact_match_rule("surname")
)
```

All of the [tutorials](../../demos/tutorials/03_Blocking.ipynb), [example notebooks](../../demos/examples/examples_index.md) and [API docs](../../blocking_rule_library.md) have been updated to use `block_on`.

## :electric_plug: Backend Specific Installs

Some users have had difficulties downloading Splink due to additional dependencies, some of which may not be relevant for the backend they are using. To solve this, you can now install a minimal version of Splink for your given SQL engine.

For example, to install Splink purely for Spark use the command:

```bsh
pip install 'splink[spark]'
```

See the [Getting Started page](../../getting_started.md#backend-specific-installs) for further guidance.

## :no_entry_sign: Drop support for python 3.7

From Splink 3.9.7, support has been dropped for python 3.7. This decision has been made to manage dependency clashes in the back end of Splink.

If you are working with python 3.7, please revert to Splink 3.9.6.

```bsh
pip install splink==3.9.6
```

## :soon: What's in the pipeline?

* :four: Work on **Splink 4** is currently underway
* :material-thumbs-up-down: More guidance on how to evaluate Splink models and linkages




Binary file added docs/blog/posts/img/charts_gallery.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/blog/posts/img/completeness_chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/blog/posts/img/confusion_matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/charts/profile_columns.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@
"\n",
"To take this skew into account, we can build Splink models with **Term Frequency Adjustments**. These adjustments will increase the amount of evidence for rare matching values and reduce the amount of evidence for common matching values.\n",
"\n",
"To understand how these work in more detail, check out the [Term Frequency Adjustments Topic Guide](../comparisons/term-frequency.md)\n",
"To understand how these work in more detail, check out the [Term Frequency Adjustments Topic Guide](../topic_guides/comparisons/term-frequency.md)\n",
"\n",
"<hr>"
]
Expand Down
2 changes: 1 addition & 1 deletion docs/comparison_helpers.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ tags:
---
# Documentation for `comparison_helpers` functions

The `comparison_helpers` functions are a set of functions to help users create better comparisons by helping them understand [string comparators](./topic_guides/choosing_comparators.ipynb#comparing-string-similarity-and-distance-scores) (fuzzy matching) and [phonetic matching](./topic_guides/choosing_comparators.ipynb#phonetic-matching).
The `comparison_helpers` functions are a set of functions to help users create better comparisons by helping them understand [string comparators](./topic_guides/choosing_comparators.ipynb#comparing-string-similarity-and-distance-scores) (fuzzy matching) and [phonetic matching](./topic_guides/comparisons/choosing_comparators.ipynb#phonetic-matching).

The detailed API for each of these are outlined below.

Expand Down
4 changes: 2 additions & 2 deletions docs/comparison_level_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ toc_depth: 2
# Documentation for `comparison_level_library`

The `comparison_level_library` contains pre-made comparison levels available for use to
construct custom comparisons [as described in this topic guide](./topic_guides/customising_comparisons.html#method-3-comparisonlevels).
However, not every comparison level is available for every [Splink-compatible SQL backend](./topic_guides/backends.html).
construct custom comparisons [as described in this topic guide](./topic_guides/comparisons/customising_comparisons.html#method-3-comparisonlevels).
However, not every comparison level is available for every [Splink-compatible SQL backend](./topic_guides/splink_fundamentals/backends.html).

The pre-made Splink comparison levels available for each SQL dialect are as given in this table:

Expand Down
4 changes: 2 additions & 2 deletions docs/comparison_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ toc_depth: 2
---
# Documentation for `comparison_library`

The `comparison_library` contains pre-made comparisons available for use directly [as described in this topic guide](./topic_guides/customising_comparisons.html#method-1-using-the-comparisonlibrary).
However, not every comparison is available for every [Splink-compatible SQL backend](./topic_guides/backends.html).
The `comparison_library` contains pre-made comparisons available for use directly [as described in this topic guide](./topic_guides/comparisons/customising_comparisons.html#method-1-using-the-comparisonlibrary).
However, not every comparison is available for every [Splink-compatible SQL backend](./topic_guides/splink_fundamentals/backends/backends.html).

The pre-made Splink comparisons available for each SQL dialect are as given in this table:

Expand Down
Loading
Loading