perf(license): tf-idf based matching #99

cmdoret · 2023-11-13T13:35:56Z

Context

Up to now, license matching was done using the scancode-toolkit package, which has the following drawbacks:

It depends on compiled packages that are not available on arm64 (i.e. newer macbooks)
It is slow (2.7 seconds to match an Apache-2.0 license)
It has many dependencies

Proposal

This PR replaces the rule-based scancode matcher with a probabilistic matcher based on Term Frequency-Inverse Document Frequency (TF-IDF). Given an input license, this implementation works as follows:

Tokenize input license
Compute tf-idf vector
Compute cosine similarity against pre-computed tf-idf vectors of SPDX licenses
Pick the license with the highest similarity if it is above a (conservative) similarity threshold

This implies:

We need to ship a matrix of pre-computed tf-idf vectors and a fitted tf-idf vectorizer with the package, making it a bit heavier
We need a script to re-compute these vectors

Visual representation of TFIDF

The process of computing TF-IDF vectors is illustrated below, with a corpus of 2 documents containing a single sentence each.

graph TD
  subgraph Corpus
    D1[The GPL3 license]
    D2[The MIT license]
    C1["the, gpl3, license"]
    C2["the, mit, license"]
  end



  subgraph "Term-Frequency Matrix"
    F1["the: 1, gpl3: 1, license: 1"]
    F2["the: 1, mit: 1, license: 1"]
    TF["`TF (n_docs x n_terms)`"]
  end

  subgraph "Inverse Document Frequency Vector"
    IDF["IDF (1 x n_terms)"]
  end

  subgraph "TF-IDF matrix"
    TFIDF[TF-IDF]
  end

  D1 -->|tokenization| C1
  D2 -->|tokenization| C2
  C1 -->|counts| F1
  C2 -->|counts| F2
  F1 -->|build matrix| TF
  F2 -->|build matrix| TF
  TF -->|1 / Proportion of document containing term| IDF
  TF -->|multiply| TFIDF
  IDF -->|multiply| TFIDF

Changes

This PR implements 3 elements:

A tf-idf vectorizer that can be serialized / parsed to json (in gimie.utils.text)
A script to download SPDX licenses and regenerate the pre-computed files (in scripts/generate_tfidf.py)
Adapt LicenseParser to use this tf-idf vectorizer ( in gimie.parsers.license)

It also:

Updates dependencies (-scancode, +pydantic, +scipy)
Update the supported python versions from 3.8-3.11 -> 3.9-3.12
Embeds the pre-computed tf-idf vectors and fitted vectorizer in gimie/parsers/license/data

Alternative solution

Implementing and testing a tf-idf vectorizer might be considered outside the scope of gimie.
The branch refactor/sklearn-tfidf drops the custom TfidfVectorizer and instead imports the scikit-learn implementation and uses skops to securely serialize / parse it (instead of pickle, which has security issues).

Both implementations yield the same results, but the serialized TfidfVectorizer from scikit-learn is much larger and slower to deserialize:

method	file-size	deserialization time
custom-tfidf	24kb	0.43ms
sklearn+skops	7.8Mb	223ms
sklearn+skops+zip-level9	564kb	232ms

Accuracy

Below are metrics computed on a sample of 2443 repositories from the paperswithcode links-between-papers-and-code source dataset link. The numbers are not exact for the following reasons:

Some repositories have multiple licenses, only one arbitrary license file was considered here.
When a license file contains multiple concatenated licenses, the GitHub API sometimes predicts only first one (instead of returning NOASSERTION).

full results table: tfidf_predictions_pwc.csv

When comparing the matched against the github-api results (excluding those where GitHub failed to identify the license), we get 97.2% accuracy.

detailed results

Confusion matrix on the most common licenses:

And the repositories for which the license was confidently assigned differently than GitHub (most have 2 or more licenses):

url	license_github	tfidf_pred	tfidf_cosine_similarity
https://github.com/HAWinther/MG-PICOLA-PUBLIC	GPL-2.0	GPL-3.0	0.9799652
https://github.com/zhongliliu/elastool	GPL-3.0	GPL-2.0	0.9751945
https://github.com/jerichooconnell/fastCAT	GPL-3.0	AGPL-3.0	0.9806217
https://github.com/SWIFTSIM/swiftsimio	LGPL-3.0	GPL-3.0	0.9799652
https://github.com/wenjiedu/brewpots	GPL-3.0	BSD-3-Clause	0.9227585
https://github.com/nilesh2797/zestxml	BSD-3-Clause	BSD-2-Clause	0.9029458
https://github.com/bgris/odl	MPL-2.0	OSET-PL-2.1	0.9216744
https://github.com/marco-oliva/afm	MIT	GPL-3.0	0.9799652
https://github.com/jsl03/apricot	GPL-3.0	AGPL-3.0	0.9805915
https://github.com/jakobrunge/tigramite	GPL-3.0	AGPL-3.0	0.9806152

Questions

Can we tolerate a small margin of error when attributing licenses? We can adjust the threshold if needed.
Do we prefer the faster and lighter custom implementation, or reducing the amount of code by using scipy?
- Tradeoff: 0.23s and 5kb vs 173 lines of code (+150 lines of comments)

vancauwe

Really nice :)
I think my main concern is to make our custom class for TFIDF lighter. In sklearn they separate count vectoring from fitting/transforming using tfidf: could we refactor the code to do this too?
As we are using a custom solution, it would also be ok to keep as is, as we plan to move in the future to using sklearn implementation. The question is also, would the refactoring make a transition easier?

gimie/parsers/license/__init__.py

gimie/utils/text.py

scripts/generate_tfidf.py

gimie/utils/text.py

cmdoret added 17 commits November 11, 2023 01:12

refactor(utils): create gimie.utils.uri submodule

5c1c070

chore: add numpy + scipy to deps

e3a510c

feat: add tfidf vectorizer

4b62682

test: unit tests for tfidf vectorizer

98c9901

refactor(tfidf): more intuitive func names

c8bc3e5

fix(tfidf): correct ngrams tokenization for n>1, adjust doctests

ab14331

ci: change python versions 3.8-3.10 -> 3.9-3.12

a807c51

chore: add scipy to deps

b5c3bd3

refactor(license): use tfidf in LicenseParser

2849bee

chore: rm scancode from deps

bec8e9e

feat: add pre-computed license tf-idf

486cf11

feat: script to regen. tf-idf for all spdx licenses

bd979de

test(license): update docstrings for tfidf

2288932

test(tfidf): rm test corpus from module, adapt doctest

6bee671

refactor(license): only include osi-approved licenses in tfidf matrix

5a3472b

refactor(license): set min similarity to 0.9

ddf47b6

perf(tfidf): prune vectors to float16 ro reduce memory footprint

907f6d9

cmdoret self-assigned this Nov 13, 2023

cmdoret added the enhancement New feature or request label Nov 13, 2023

cmdoret linked an issue Nov 13, 2023 that may be closed by this pull request

Implement license matcher #89

Closed

6 tasks

chore(license): black fmt

35ad2d9

cmdoret marked this pull request as ready for review November 13, 2023 15:29

cmdoret added 2 commits November 13, 2023 17:11

doc(license): mention tfidf in parser docstring

a2b6fae

chore: rename test_tfidf.py

2cb8601

cmdoret requested a review from vancauwe November 16, 2023 09:57

docs(tfidf): link to sklearn documentation

0b58c5c

vancauwe requested changes Nov 22, 2023

View reviewed changes

cmdoret added 3 commits November 23, 2023 10:09

refactor(tfidf): reorder methods

edf3683

refactor: rename utils text module

2a09c88

fix: gimie.utils.text_processing import paths

2100c90

cmdoret requested a review from vancauwe November 23, 2023 10:21

cmdoret merged commit 77a17f5 into main Nov 24, 2023
8 checks passed

cmdoret deleted the perf/license-tfidf branch December 14, 2023 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(license): tf-idf based matching #99

perf(license): tf-idf based matching #99

cmdoret commented Nov 13, 2023 •

edited

Loading

vancauwe left a comment

perf(license): tf-idf based matching #99

perf(license): tf-idf based matching #99

Conversation

cmdoret commented Nov 13, 2023 • edited Loading

Context

Proposal

Changes

Alternative solution

Accuracy

Questions

vancauwe left a comment

Choose a reason for hiding this comment

cmdoret commented Nov 13, 2023 •

edited

Loading