Finding similar items with "Locality-Sensitive Hashing". Inspired by Chapter 3: "Finding Similar Items" of "Mining Massive Datasets" book by Jure Leskovec, Anand Rajaraman and Jeff Ullman.
See examples for use cases.
Steps in code:
#Shingle
- tokenize#Minhash
- signature matrix#LSH
- candidate pairs#Jaccard
- for jaccard similarity of candidate pairs
in CLI: ./lsh lsh -s <comma_separated_URLs>
. For example:
./lsh lsh -s https://stackoverflow.com,https://stackoverflow.com
shingling 2 sources:
[0]: https://stackoverflow.com - more stack exchange
[1]: https://stackoverflow.com - more stack exchange
hashing 2 sets
found 1 candidate pair(s)
[0_1]
This means that sources 0 and 1 ended up as a candidate pair, therefore they are suggested for similarity test.
Then ./lsh sim -s <two_comma_separated_URLs>
. For example:
./lsh sim -s https://stackoverflow.com,https://stackoverflow.com
shingling 2 sources:
[0]: https://stackoverflow.com - more stack exchange
[1]: https://stackoverflow.com - more stack exchange
similarity: 1.0000
- Jaccard
TODO.
See CHANGELOG.md
Released under the Apache License 2.0.