Hybrid query processing aims to identify these objects with similar vectors to query object and satisfying the given attribute constraints [SIGMOD'20, VLDB'20, SIGMOD'21, KDD'21]; this pressing need is fueled by massive amount of emerging applications such as visual products search [VLDB'20, SIGMOD'21, Middleware'18], academic expert finding, movie recommendation, etc. Our paper entitled "Navigable Proximity Graph-Driven Native Hybrid Queries with Structured and Unstructured Constraints" provides a Native Hybrid Query (NHQ) solution outperforms the state-of-the-art competitors (10x faster under the same recall) on all experimental datasets.
This repo contains the code, dataset, optimal parameters, and other detailed information used for the experiments of our paper.
The compared methods include two categories: one is to verify the effectiveness of the proposed Navigable Proximity Graphs (NPGs), i.e., NPG_nsw and NPG_kgraph; the other is to evaluate the proposed NPG-based hybrid query methods working off NHQ framework.
PGs
- HNSW (TPAMI'2018) is a hierarchical PG, which is widely used in various fields and produces many optimized versions based on hardware or machine learning.
- NSW (IS'2014) is the precursor of HNSW, and is a single-layer PG constructed by incrementally inserting data.
- KGraph (WWW'2011) is an approximate K-nearest neighbor graph quickly built via NN-Descent. It only considers the distance factor when selecting edges.
- DPG (TKDE'2019) maximizes the angle between neighbors on the basis of KGraph to alleviate redundant calculation.
- NSG (VLDB'2019) ensures that the monotonicity of the search path by approximating the monotonic search network, thereby avoiding detouring.
- NSSG (TPAMI'2021) is similar to DPG, it adjusts the angle between neighbors to adapt to different data characteristics, to achieve an optimal trade-off between detours and shortcuts.
- NPG_nsw and NPG_kgraph are two NPGs proposed by our paper.
Hybrid query methods
- ADBV (VLDB'2020) is a cost-based hybrid query method proposed by Alibaba. It implements PQ and linear scan for vector similarity search, thus forming four sub-plans; and the least cost one is selected for answering a hybrid query.
- Milvus (SIGMOD'2021) adopts a partition-based approach regarding attribute; it divides the object set through frequently used attributes, and deploys ADBV on each subset.
- Vearch (Middleware'2018) is a repo developed by Jingdong. It first recalls similar candidates with unstructured constraint on HNSW, and then performs attribute filtering on the candidates to yield the final results.
- NGT (SISAP'2016) is a library for performing high-speed ANNS released by Yahoo Japan. We implemented the hybrid query processing that conducts attribute filtering atop the candidates recalled by NGT.
- Faiss (IVFPQ, TPAMI'2011) is popular quantization-based vector similarity search library developed by Facebook. We implement its hybrid query processing based on IVFPQ and strategy A (Fig2 in our paper).
- SPTAG (ACM MM'2012, CVPR'2012, TPAMI'2014) is a PG-based vector similarity search library released by Microsoft, and it answers a hybrid query on strategy B (Fig2 in our paper).
- NHQ-NPG_nsw and NHQ-NPG_kgraph is our hybrid query methods based on NHQ framework and two NPGs.
Our experiment involves eight publicly available real-world datasets and one in-house dataset. Among them, the eight public datasets are composed of high-dimensional feature vectors extracted from the unstructured information, and they do not originally contain attributes; at this time, they are used for the performance evaluation of PGs. There is no publicly available dataset thus far that contains both structured and unstructured information [VLDB'20]. Therefore, we generate corresponding set of attributes for each object in public datasets following [SIGMOD'21]. For example, we add attributes such as date, location, size, etc. to each image on SIFT1M to form an object with a feature vector and a set of attributes. For the in-house dataset, each object in it consisting of high-dimensional vector extracted from paper content as well as three structured attributes, i.e., affiliation, topic, publication. The following table summarizes their main information.
object_num | feature-vector_dim | query_num | type | download(vector) | download (Attributes) | |
---|---|---|---|---|---|---|
SIFT1M | 1000000 | 128 | 10000 | Image+Attribute | sift.tar.gz(161MB) | sift_attribute.tar.gz |
GIST1M | 1000000 | 960 | 1000 | Image+Attribute | gist.tar.gz(2.6GB) | gist_attribute.tar.gz |
GloVe | 1183514 | 100 | 10000 | Text+Attribute | glove-100.tar.gz(424MB) | glove-100_attribute.tar.gz |
Crawl | 1989995 | 300 | 10000 | Text+Attribute | crawl.tar.gz(1.7GB) | crawl_attribute.tar.gz |
Audio | 53387 | 192 | 200 | Audio+Attribute | audio.tar.gz(26MB) | audio_attribute.tar.gz |
Msong | 992272 | 420 | 200 | Audio+Attribute | msong.tar.gz(1.4GB) | msong_attribute.tar.gz |
Enron | 94987 | 1369 | 200 | Text+Attribute | enron.tar.gz(51MB) | enron_attribute.tar.gz |
UQ-V | 1000000 | 256 | 10000 | Video+Attribute | uqv.tar.gz(800MB) | uqv_attribute.tar.gz |
Paper | 2029997 | 200 | 10000 | Text+Attribute | paper.tar.gz(1.41GB) | paper_attribute.tar.gz |
Note that, all original objects and query object are converted to fvecs
format, and groundtruth data is converted to ivecs
format. Please refer here for the description of fvecs
and ivecs
format.
Because parameters' adjustment in the entire object set may cause overfitting, we randomly sample a certain percentage of data points from the original object set to form a validation set. We search for the optimal value of all the adjustable parameters of each algorithm on each validation set, to make the algorithms' search performance reach the optimal level. See the parameters page for more information.
Prerequistes
GCC 4.9+ wih OpenMP
CMake 2.8+
Boost 1.55+
Installation
You can run the build.sh
script to install all algorithms, including NPG_kgraph, NPG_nsw, NHQ-NPG_kgraph and NHQ-NPG_nsw.
Usage
After performing the installation, you can test each algorithm via the script test_hybrid_query.py
or test_npg.py
, and the specific parameter values can be found in parameters.
Validation of NHQ framework
To verify the capabilities of NHQ, we implement the hybrid queries additionally based on the "first vector similarity search, and then attribute filtering" strategy across NPG_kgraph and NPG_nsw. Thus, for hybrid queries on NHQ, you can test it by setting the following options in the script test_hybrid_query.py
:
<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw' # build and search
For the first vector search then label filtering, you can test it by setting the following options in test_hybrid_query.py
:
<index>: 'NPG_kgraph' or 'NPG_nsw' # build and search
NPG's Performance
To verify the effectiveness of the proposed edge selection and routing strategies, you can evaluate NPG algorithms (i.e., NPG_kgraph and NPG_nsw) by setting the following options in test_npg.py
:
<index>: 'NPG_kgraph' or 'NPG_nsw' # build and search
Evaluation of Hybrid Query Methods
You can test it by setting the following options in the script test_hybrid_query.py
:
<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw' # build and search
Parameter Sensitivity
For NHQ, /NHQ-NPG_kgraph/src/index_graph.cpp
.
void IndexGraph::fusion_distance(float &dist, float &cnt)
{
// dist = cnt * dist * 2 / (cnt + dist); //w_x=cnt/(dist + cnt), w_y=dist/(dist + cnt)
// cnt *= 100; // w_x=cnt*/(dist + cnt*), w_y=dist/(dist + cnt*)
// if(cnt == 0) cnt = 1;
// dist = cnt * dist * 2/(cnt + dist);
// dist = dist / 521675 + float(cnt) / 3.0; // w_x=1/dist_max, w_y=1/cnt_max
dist += dist * cnt / (float)attribute_number_; //w_x=1, w_y=dist/cnt_max
// dist += 10000 * cnt; //w_x/w_y=c, and c is a constant
}
The implementation of NPG_kgraph is taken from kgraph, ssg, nns_benchmark, and the implementation of NPG_nsw is taken from n2. Many thanks to them for inspiration. Thanks to everyone who provided references for this project.