-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(license): tf-idf based matching #99
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice :)
I think my main concern is to make our custom class for TFIDF lighter. In sklearn they separate count vectoring from fitting/transforming using tfidf: could we refactor the code to do this too?
As we are using a custom solution, it would also be ok to keep as is, as we plan to move in the future to using sklearn implementation. The question is also, would the refactoring make a transition easier?
Context
Up to now, license matching was done using the
scancode-toolkit
package, which has the following drawbacks:Proposal
This PR replaces the rule-based scancode matcher with a probabilistic matcher based on Term Frequency-Inverse Document Frequency (TF-IDF). Given an input license, this implementation works as follows:
This implies:
Visual representation of TFIDF
The process of computing TF-IDF vectors is illustrated below, with a corpus of 2 documents containing a single sentence each.
Changes
This PR implements 3 elements:
gimie.utils.text
)scripts/generate_tfidf.py
)LicenseParser
to use this tf-idf vectorizer ( ingimie.parsers.license
)It also:
Updates dependencies (-
scancode
, +pydantic
, +scipy
)Update the supported python versions from
3.8
-3.11
->3.9
-3.12
Embeds the pre-computed tf-idf vectors and fitted vectorizer in
gimie/parsers/license/data
Alternative solution
Implementing and testing a tf-idf vectorizer might be considered outside the scope of gimie.
The branch
refactor/sklearn-tfidf
drops the customTfidfVectorizer
and instead imports the scikit-learn implementation and usesskops
to securely serialize / parse it (instead of pickle, which has security issues).Both implementations yield the same results, but the serialized TfidfVectorizer from scikit-learn is much larger and slower to deserialize:
Accuracy
Below are metrics computed on a sample of 2443 repositories from the paperswithcode
links-between-papers-and-code
source dataset link. The numbers are not exact for the following reasons:full results table: tfidf_predictions_pwc.csv
When comparing the matched against the github-api results (excluding those where GitHub failed to identify the license), we get 97.2% accuracy.
detailed results
Confusion matrix on the most common licenses:
And the repositories for which the license was confidently assigned differently than GitHub (most have 2 or more licenses):
Questions