Measuring the ANN Recall vs. Performance Trade-Off

Example project for measuring ANN (vs exact) recall at (local developer notebook) scale.

In this example we use:

Setup

Virtual environment and dependencies

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Start OpenSearch with Dashboards

docker compose up

To reproduce run the following steps:

Create text embeddings of the first 1000 chars of all the 6.7m Wikipedia EN articles. This takes about 10h on a M1 Max Macbook but only needs to be computed once.

python corpus_embed.py

Index the text embeddings to OpenSearch. We also index the title/text into text fields for possible later experiments. Indexing the whole dataset takes about 2h on the M1 Max. The index will take around 80GB of disk space.

python corpus_indexing.py

Calculate the ANN vs. KNN metrics. We take the WikiQA questions and compute the exact nearest neighbours against the embeddings from step 1. We then run the questions against the OpenSearch index and compute the recall of the returned approximate nearest neighbours.

python calculate_ann_metrics.py

This example only evaluates for one embedding model. Adjust corpus_embed.py to create embeddings for multiple models.
This example only indexes into a single shard with no replica. To try ANN against a multi-shard index adjust corpus_indexing.py and create an index with multiple shards/replicas. You might need to increase the available OpenSearch heap size to accommodate for the additional overhead of having more than one shard.
This example only iterates through the query-side kNN parameter k. To also try different server-side parameter values for efConstruction and m adjust corpus_indexing.py.
This example does not try another other than the Lucene kNN backend for OpenSearch. To try Faiss or nmslib, again expand the indexing code.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
calculate_ann_metrics.py		calculate_ann_metrics.py
corpus.py		corpus.py
corpus_embed.py		corpus_embed.py
corpus_indexing.py		corpus_indexing.py
docker-compose.yml		docker-compose.yml
embeddings.py		embeddings.py
opensearch_indexer.py		opensearch_indexer.py
queries.csv		queries.csv
requirements.txt		requirements.txt