Navigable Proximity Graph-Driven Native Hybrid Queries with Structured and Unstructured Constraints

1. Introduction

Hybrid query processing aims to identify these objects with similar vectors to query object and satisfying the given attribute constraints [SIGMOD'20, VLDB'20, SIGMOD'21, KDD'21]; this pressing need is fueled by massive amount of emerging applications such as visual products search [VLDB'20, SIGMOD'21, Middleware'18], academic expert finding, movie recommendation, etc. Our paper entitled "Navigable Proximity Graph-Driven Native Hybrid Queries with Structured and Unstructured Constraints" provides a Native Hybrid Query (NHQ) solution outperforms the state-of-the-art competitors (10x faster under the same recall) on all experimental datasets.

This repo contains the code, dataset, optimal parameters, and other detailed information used for the experiments of our paper.

2. Compared methods

The compared methods include two categories: one is to verify the effectiveness of the proposed Navigable Proximity Graphs (NPGs), i.e., NPG_nsw and NPG_kgraph; the other is to evaluate the proposed NPG-based hybrid query methods working off NHQ framework.

PGs

HNSW (TPAMI'2018) is a hierarchical PG, which is widely used in various fields and produces many optimized versions based on hardware or machine learning.
NSW (IS'2014) is the precursor of HNSW, and is a single-layer PG constructed by incrementally inserting data.
KGraph (WWW'2011) is an approximate K-nearest neighbor graph quickly built via NN-Descent. It only considers the distance factor when selecting edges.
DPG (TKDE'2019) maximizes the angle between neighbors on the basis of KGraph to alleviate redundant calculation.
NSG (VLDB'2019) ensures that the monotonicity of the search path by approximating the monotonic search network, thereby avoiding detouring.
NSSG (TPAMI'2021) is similar to DPG, it adjusts the angle between neighbors to adapt to different data characteristics, to achieve an optimal trade-off between detours and shortcuts.
NPG_nsw and NPG_kgraph are two NPGs proposed by our paper.

Hybrid query methods

ADBV (VLDB'2020) is a cost-based hybrid query method proposed by Alibaba. It implements PQ and linear scan for vector similarity search, thus forming four sub-plans; and the least cost one is selected for answering a hybrid query.
Milvus (SIGMOD'2021) adopts a partition-based approach regarding attribute; it divides the object set through frequently used attributes, and deploys ADBV on each subset.
Vearch (Middleware'2018) is a repo developed by Jingdong. It first recalls similar candidates with unstructured constraint on HNSW, and then performs attribute filtering on the candidates to yield the final results.
NGT (SISAP'2016) is a library for performing high-speed ANNS released by Yahoo Japan. We implemented the hybrid query processing that conducts attribute filtering atop the candidates recalled by NGT.
Faiss (IVFPQ, TPAMI'2011) is popular quantization-based vector similarity search library developed by Facebook. We implement its hybrid query processing based on IVFPQ and strategy A (Fig2 in our paper).
SPTAG (ACM MM'2012, CVPR'2012, TPAMI'2014) is a PG-based vector similarity search library released by Microsoft, and it answers a hybrid query on strategy B (Fig2 in our paper).
NHQ-NPG_nsw and NHQ-NPG_kgraph is our hybrid query methods based on NHQ framework and two NPGs.

3. Datasets

Our experiment involves eight publicly available real-world datasets and one in-house dataset. Among them, the eight public datasets are composed of high-dimensional feature vectors extracted from the unstructured information, and they do not originally contain attributes; at this time, they are used for the performance evaluation of PGs. There is no publicly available dataset thus far that contains both structured and unstructured information [VLDB'20]. Therefore, we generate corresponding set of attributes for each object in public datasets following [SIGMOD'21]. For example, we add attributes such as date, location, size, etc. to each image on SIFT1M to form an object with a feature vector and a set of attributes. For the in-house dataset, each object in it consisting of high-dimensional vector extracted from paper content as well as three structured attributes, i.e., affiliation, topic, publication. The following table summarizes their main information.

	object_num	feature-vector_dim	query_num	type	download(vector)	download (Attributes)
SIFT1M	1000000	128	10000	Image+Attribute	sift.tar.gz(161MB)	sift_attribute.tar.gz
GIST1M	1000000	960	1000	Image+Attribute	gist.tar.gz(2.6GB)	gist_attribute.tar.gz
GloVe	1183514	100	10000	Text+Attribute	glove-100.tar.gz(424MB)	glove-100_attribute.tar.gz
Crawl	1989995	300	10000	Text+Attribute	crawl.tar.gz(1.7GB)	crawl_attribute.tar.gz
Audio	53387	192	200	Audio+Attribute	audio.tar.gz(26MB)	audio_attribute.tar.gz
Msong	992272	420	200	Audio+Attribute	msong.tar.gz(1.4GB)	msong_attribute.tar.gz
Enron	94987	1369	200	Text+Attribute	enron.tar.gz(51MB)	enron_attribute.tar.gz
UQ-V	1000000	256	10000	Video+Attribute	uqv.tar.gz(800MB)	uqv_attribute.tar.gz
Paper	2029997	200	10000	Text+Attribute	paper.tar.gz(1.41GB)	paper_attribute.tar.gz

Note that, all original objects and query object are converted to fvecs format, and groundtruth data is converted to ivecs format. Please refer here for the description of fvecs and ivecs format.

4. Parameters

Because parameters' adjustment in the entire object set may cause overfitting, we randomly sample a certain percentage of data points from the original object set to form a validation set. We search for the optimal value of all the adjustable parameters of each algorithm on each validation set, to make the algorithms' search performance reach the optimal level. See the parameters page for more information.

5. Installation and Usage

Prerequistes

GCC 4.9+ wih OpenMP
CMake 2.8+
Boost 1.55+

Installation

You can run the build.sh script to install all algorithms, including NPG_kgraph, NPG_nsw, NHQ-NPG_kgraph and NHQ-NPG_nsw.

Usage

After performing the installation, you can test each algorithm via the script test_hybrid_query.py or test_npg.py, and the specific parameter values can be found in parameters.

Validation of NHQ framework

To verify the capabilities of NHQ, we implement the hybrid queries additionally based on the "first vector similarity search, and then attribute filtering" strategy across NPG_kgraph and NPG_nsw. Thus, for hybrid queries on NHQ, you can test it by setting the following options in the script test_hybrid_query.py:

<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw'	# build and search

For the first vector search then label filtering, you can test it by setting the following options in test_hybrid_query.py:

<index>: 'NPG_kgraph' or 'NPG_nsw'	# build and search

NPG's Performance

To verify the effectiveness of the proposed edge selection and routing strategies, you can evaluate NPG algorithms (i.e., NPG_kgraph and NPG_nsw) by setting the following options in test_npg.py:

<index>: 'NPG_kgraph' or 'NPG_nsw'	# build and search

Evaluation of Hybrid Query Methods

You can test it by setting the following options in the script test_hybrid_query.py:

<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw'	# build and search

Parameter Sensitivity

For NHQ, $\omega _{x}$ and $\omega _{y}$ in Equation 5 of our paper are a pair of parameters that regulate the weights of $\delta$ and $\chi$, where $\delta$ is the distance between feature vectors, and $\chi$ is the distance between attribute vectors. Thus, $\omega _{x}$ and $\omega _{y}$ will impact the hybrid query performance. You can set different $\omega _{x}$ and $\omega _{y}$ by modify the following code in /NHQ-NPG_kgraph/src/index_graph.cpp.

  void IndexGraph::fusion_distance(float &dist, float &cnt)
  {
    // dist = cnt * dist * 2 / (cnt + dist); //w_x=cnt/(dist + cnt), w_y=dist/(dist + cnt)
    // cnt *= 100; // w_x=cnt*/(dist + cnt*), w_y=dist/(dist + cnt*)
    // if(cnt == 0) cnt = 1;
    // dist = cnt * dist * 2/(cnt + dist);
    // dist = dist / 521675 + float(cnt) / 3.0;  // w_x=1/dist_max, w_y=1/cnt_max
    dist += dist * cnt / (float)attribute_number_;  //w_x=1, w_y=dist/cnt_max
    // dist += 10000 * cnt; //w_x/w_y=c, and c is a constant
  }

6. Acknowledgements

The implementation of NPG_kgraph is taken from kgraph, ssg, nns_benchmark, and the implementation of NPG_nsw is taken from n2. Many thanks to them for inspiration. Thanks to everyone who provided references for this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Navigable Proximity Graph-Driven Native Hybrid Queries with Structured and Unstructured Constraints

1. Introduction

2. Compared methods

3. Datasets

4. Parameters

5. Installation and Usage

6. Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
NHQ-NPG_kgraph		NHQ-NPG_kgraph
NHQ-NPG_nsw		NHQ-NPG_nsw
NPG_kgraph		NPG_kgraph
NPG_nsw/n2		NPG_nsw/n2
parameters		parameters
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
test_hybrid_query.py		test_hybrid_query.py
test_npg.py		test_npg.py

zhujx001/TKDE-under-review-Native-Hybrid-Queries-via-ANNS

Folders and files

Latest commit

History

Repository files navigation

Navigable Proximity Graph-Driven Native Hybrid Queries with Structured and Unstructured Constraints

1. Introduction

2. Compared methods

3. Datasets

4. Parameters

5. Installation and Usage

6. Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages