Skip to content

ZhimingMei/Big-Data-Matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big-Data-Matching

Introduction

Now we have a very large dataset, and we want to figure out our target information using queries that require exact matching, or fuzzy matching.

And we have a lot of existing amazing packages to support this work. In this repository, I will test the time complexity (matching efficiency), and the matching accuracy of some chosen packages.

Project Structure

Program structure is given as follow:
    ├─data: will not uploaded
    ├─paper: some research & empirical papers
    ├─faiss: a package developed by Facebook AI Research
    │  └─tutorial: some code examples
    ├─string_grouper: a super fast string matching package in Python
    │  └─tutorial: some code examples
    ├─splink: help link dataset without unique identifier (probablistic model)
    │  └─tutorial: some code examples
    ├─further_check: identify the true matching pairs from fuzzy matching results
    │  └─tutorial: some code examples
    ...

Workflow

  1. We first implement the fuzzy matching (to get potential matching pairs)
    • In this step, for the fuzzy matching with multiple comparison levels (requirements), we prefer Splink README-splink.
    • For the fuzzy matching with only one comparison level, i.e., we need to find best potential results for one pair (company pair, fullname pair, etc.), we prefer String_grouper README-string grouper
  2. Then, we calculate some relative scores, to further check whether the potential pairs are true or not.
    • We calculate some scores (including the Levenshtein distance, ngram distance, phonetic distance, etc. See Scores-code example here) first. We also figure out an algorithm that can help improve the predicting quality in RF model, which scores based on the name features (first name/middle name/last name, name initial, name composition, name "etymology". See Custom_scores-code example)
    • We construct a random forest model to predict the matching likelihood.
    • BTW, here's a interesting tool to help identify whether the potential matching pairs are true or not, that is Claude Claude-code example. (to assist the manual checking/labeling process)

Results

Testing the matching algorithms

FAISS

Input dataset scale\time SentenceTransformer
searching time
Another vectorization (TFIDF)
vectorization+training+searching
GPU version
(10000*10000) 0.4s 1.4s | + 0.9s | + 1.2s | = 3.5s
(50000*50000) 9.9s 8.2s |+ 4.5s | + 48.7s |= 61.4s
(100000*100000) 40.6s 17s |+ 6s | + 241.7s| = 264s
(500000*500000) 1043.5s | | |

String Grouper

Compared with FAISS

Input data scale\time Num_process = int(ncpu*3/4) Num_process = 1 Speed of faiss
(10000*10000) 2.4s 2s 3.5s
(50000*50000) 10.6s 21s 61.4s
(100000*100000) 21.1s 65s 264s
(500000*500000) 127.5s 1272s

Coding Tips

  1. I am using the Polars dataframe to speed up the I/O process. And if the Polars dataframe contains several data types (check data.dtypes), when transfering to Pandas dataframe, it will occur some errors.

    The way I solve the problem is [simply change the columns with list data type to string data type]:

    for col in ['company_raw', 'location', 'naics_code', 'country', 'state', 'msa', 'company_ticker']:
        data = data.with_columns(
           pl.format("[{}]",
              pl.col(col).cast(pl.List(pl.Utf8)).list.join(", ")).alias(col))

Post and News

  • Announcing ScaNN: Efficient Vector Similarity Search [Blog]

About

Explore efficient big data matching techniques

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published