Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(license): tf-idf based matching #99

Merged
merged 24 commits into from
Nov 24, 2023
Merged

perf(license): tf-idf based matching #99

merged 24 commits into from
Nov 24, 2023

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Nov 13, 2023

Context

Up to now, license matching was done using the scancode-toolkit package, which has the following drawbacks:

  • It depends on compiled packages that are not available on arm64 (i.e. newer macbooks)
  • It is slow (2.7 seconds to match an Apache-2.0 license)
  • It has many dependencies

Proposal

This PR replaces the rule-based scancode matcher with a probabilistic matcher based on Term Frequency-Inverse Document Frequency (TF-IDF). Given an input license, this implementation works as follows:

  1. Tokenize input license
  2. Compute tf-idf vector
  3. Compute cosine similarity against pre-computed tf-idf vectors of SPDX licenses
  4. Pick the license with the highest similarity if it is above a (conservative) similarity threshold

This implies:

  • We need to ship a matrix of pre-computed tf-idf vectors and a fitted tf-idf vectorizer with the package, making it a bit heavier
  • We need a script to re-compute these vectors
Visual representation of TFIDF

The process of computing TF-IDF vectors is illustrated below, with a corpus of 2 documents containing a single sentence each.

graph TD
  subgraph Corpus
    D1[The GPL3 license]
    D2[The MIT license]
    C1["the, gpl3, license"]
    C2["the, mit, license"]
  end



  subgraph "Term-Frequency Matrix"
    F1["the: 1, gpl3: 1, license: 1"]
    F2["the: 1, mit: 1, license: 1"]
    TF["`TF (n_docs x n_terms)`"]
  end

  subgraph "Inverse Document Frequency Vector"
    IDF["IDF (1 x n_terms)"]
  end

  subgraph "TF-IDF matrix"
    TFIDF[TF-IDF]
  end

  D1 -->|tokenization| C1
  D2 -->|tokenization| C2
  C1 -->|counts| F1
  C2 -->|counts| F2
  F1 -->|build matrix| TF
  F2 -->|build matrix| TF
  TF -->|1 / Proportion of document containing term| IDF
  TF -->|multiply| TFIDF
  IDF -->|multiply| TFIDF
Loading

Changes

This PR implements 3 elements:

  • A tf-idf vectorizer that can be serialized / parsed to json (in gimie.utils.text)
  • A script to download SPDX licenses and regenerate the pre-computed files (in scripts/generate_tfidf.py)
  • Adapt LicenseParser to use this tf-idf vectorizer ( in gimie.parsers.license)

It also:

  • Updates dependencies (-scancode, +pydantic, +scipy)

  • Update the supported python versions from 3.8-3.11 -> 3.9-3.12

  • Embeds the pre-computed tf-idf vectors and fitted vectorizer in gimie/parsers/license/data

Alternative solution

Implementing and testing a tf-idf vectorizer might be considered outside the scope of gimie.
The branch refactor/sklearn-tfidf drops the custom TfidfVectorizer and instead imports the scikit-learn implementation and uses skops to securely serialize / parse it (instead of pickle, which has security issues).

Both implementations yield the same results, but the serialized TfidfVectorizer from scikit-learn is much larger and slower to deserialize:

method file-size deserialization time
custom-tfidf 24kb 0.43ms
sklearn+skops 7.8Mb 223ms
sklearn+skops+zip-level9 564kb 232ms

Accuracy

Below are metrics computed on a sample of 2443 repositories from the paperswithcode links-between-papers-and-code source dataset link. The numbers are not exact for the following reasons:

  • Some repositories have multiple licenses, only one arbitrary license file was considered here.
  • When a license file contains multiple concatenated licenses, the GitHub API sometimes predicts only first one (instead of returning NOASSERTION).

full results table: tfidf_predictions_pwc.csv

When comparing the matched against the github-api results (excluding those where GitHub failed to identify the license), we get 97.2% accuracy.

detailed results

Confusion matrix on the most common licenses:
image

And the repositories for which the license was confidently assigned differently than GitHub (most have 2 or more licenses):

url license_github tfidf_pred tfidf_cosine_similarity
https://github.com/HAWinther/MG-PICOLA-PUBLIC GPL-2.0 GPL-3.0 0.9799652
https://github.com/zhongliliu/elastool GPL-3.0 GPL-2.0 0.9751945
https://github.com/jerichooconnell/fastCAT GPL-3.0 AGPL-3.0 0.9806217
https://github.com/SWIFTSIM/swiftsimio LGPL-3.0 GPL-3.0 0.9799652
https://github.com/wenjiedu/brewpots GPL-3.0 BSD-3-Clause 0.9227585
https://github.com/nilesh2797/zestxml BSD-3-Clause BSD-2-Clause 0.9029458
https://github.com/bgris/odl MPL-2.0 OSET-PL-2.1 0.9216744
https://github.com/marco-oliva/afm MIT GPL-3.0 0.9799652
https://github.com/jsl03/apricot GPL-3.0 AGPL-3.0 0.9805915
https://github.com/jakobrunge/tigramite GPL-3.0 AGPL-3.0 0.9806152

Questions

  • Can we tolerate a small margin of error when attributing licenses? We can adjust the threshold if needed.
  • Do we prefer the faster and lighter custom implementation, or reducing the amount of code by using scipy?
    • Tradeoff: 0.23s and 5kb vs 173 lines of code (+150 lines of comments)

@cmdoret cmdoret self-assigned this Nov 13, 2023
@cmdoret cmdoret added the enhancement New feature or request label Nov 13, 2023
@cmdoret cmdoret linked an issue Nov 13, 2023 that may be closed by this pull request
6 tasks
@cmdoret cmdoret marked this pull request as ready for review November 13, 2023 15:29
@cmdoret cmdoret requested a review from vancauwe November 16, 2023 09:57
Copy link
Contributor

@vancauwe vancauwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice :)
I think my main concern is to make our custom class for TFIDF lighter. In sklearn they separate count vectoring from fitting/transforming using tfidf: could we refactor the code to do this too?
As we are using a custom solution, it would also be ok to keep as is, as we plan to move in the future to using sklearn implementation. The question is also, would the refactoring make a transition easier?

gimie/parsers/license/__init__.py Show resolved Hide resolved
gimie/parsers/license/__init__.py Show resolved Hide resolved
gimie/utils/text.py Outdated Show resolved Hide resolved
scripts/generate_tfidf.py Show resolved Hide resolved
gimie/utils/text.py Outdated Show resolved Hide resolved
gimie/utils/text.py Outdated Show resolved Hide resolved
gimie/utils/text.py Outdated Show resolved Hide resolved
@cmdoret cmdoret requested a review from vancauwe November 23, 2023 10:21
@cmdoret cmdoret merged commit 77a17f5 into main Nov 24, 2023
8 checks passed
@cmdoret cmdoret deleted the perf/license-tfidf branch December 14, 2023 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement license matcher
2 participants