Example project for measuring ANN (vs exact) recall at (local developer notebook) scale.
In this example we use:
- ~6.7m documents of English Wikipedia (wikipedia/20230601.en from TF datasets)
- 633 questions from WikiQA
- all-MiniLM-L6-v2 as embedding model
- OpenSearch 2.17.1 with kNN plugin and the Lucene vector backend
Virtual environment and dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Start OpenSearch with Dashboards
docker compose up
To reproduce run the following steps:
- Create text embeddings of the first 1000 chars of all the 6.7m Wikipedia EN articles. This takes about 10h on a M1 Max Macbook but only needs to be computed once.
python corpus_embed.py
- Index the text embeddings to OpenSearch. We also index the title/text into text fields for possible later experiments. Indexing the whole dataset takes about 2h on the M1 Max. The index will take around 80GB of disk space.
python corpus_indexing.py
- Calculate the ANN vs. KNN metrics. We take the WikiQA questions and compute the exact nearest neighbours against the embeddings from step 1. We then run the questions against the OpenSearch index and compute the recall of the returned approximate nearest neighbours.
python calculate_ann_metrics.py
- This example only evaluates for one embedding model. Adjust
corpus_embed.py
to create embeddings for multiple models. - This example only indexes into a single shard with no replica. To try ANN against a multi-shard index
adjust
corpus_indexing.py
and create an index with multiple shards/replicas. You might need to increase the available OpenSearch heap size to accommodate for the additional overhead of having more than one shard. - This example only iterates through the query-side kNN parameter
k
. To also try different server-side parameter values forefConstruction
andm
adjustcorpus_indexing.py
. - This example does not try another other than the Lucene kNN backend for OpenSearch. To try Faiss or nmslib, again expand the indexing code.