An Elasticsearch plugin for exact and approximate K-nearest-neighbors search in high-dimensional vector spaces.
Item | Status |
---|---|
Github CI Build | |
Github Release Build | |
Plugin Release | |
Plugin Snapshot | |
Python Client, Release | |
Scala 2.12 Client, Release | |
Scala 2.12 Client, Snapshot |
This project is very much a work-in-progress. I've decided to go ahead and make the repo public since some people have expressed interest through emails and LinkedIn messages.
The "Features" section below is a good picture of my high level goals for this project. As of late January, I've
completed an ingest processor, custom queries, a custom mapper to store ElastiKnnVector
s, exact
similarity queries for all five similarity functions, MinHash LSH for Jaccard similarity, and setup a
fairly robust testing harness. The testing harness took a surprising amount of work.
If you want to contribute, you'll have to dig around quite a bit for now. The Makefile is a good place to start. Generally speaking, there are several Gradle projects: the plugin (currently on ES 7.4x); a "core" project containing protobuf definitions, models, and some utilities; a scala client based on the Elastic4s library; a reference project which compares elastiknn implementations to others (e.g. spark). There's also a Python client which provides a client roughly equivalent to the Scala one and a scikit-learn-style client.
Feel free to open issues and PRs, but I don't plan to initially spend a lot of time documenting or coordinating contributions until the code churn decreases. I'll do my best to keep the readme updated and I've made a Github project board to track ongoing work.
- Exact nearest neighbors search. This should only be used for testing and on relatively small datasets.
- Approximate nearest neighbors search using Locality Sensitive Hashing (LSH) and Multiprobe LSH. This scales well for large datasets; see the Performance section below for details.
- Supports dense floating point vectors and sparse boolean vectors.
- Supports five distance functions: L1, L2, Angular, Hamming, and Jaccard.
- Supports the two most common nearest neighbors queries:
- k nearest neighbors - i.e. "give me the k nearest neighbors to some query vector"
- fixed-radius nearest neighbors - i.e. "give me all neighbors within some radius of a query vector"
- Integrates nearest neighbor queries with existing Elasticsearch queries.
- Horizontal scalability. Vectors are stored as regular Elasticsearch documents and queries are implemented using standard Elasticsearch constructs.
TODO
TODO
TODO
TODO
TODO
Currently working on this in a fork of the Ann-Benchmarks repo here. Planning to submit a PR when all of the approximate similarities are implemented and the Docker image can be built with a release elastiknn zip file.
TODO
Planning to implement this using one of the various word vector datasets.
TODO
Not super sure of the feasability of this yet. There are some notes in benchmarks/billion.
There are three main artifacts produced by this project:
- The actual plugin, which is a zip file published to Github releases.
- The python client library, which gets published to PyPi.
- The scala client library, which gets published to Sonatype.
All three artifacts are built and published as "snapshots" on every PR commit and every push/merge to master. All three
artifacts are released on every push/merge to master in which the version
file has changed. We detect a change in the
version file by checking if a release tag exists with the same name as the version.
All of this is handled by Github Workflows with all steps defined in the yaml files in .github/workflows
.
In no particular order:
- Alex Reelsen has several open-source plugins which were useful examples for the general structure of a plugin project:
- Mining of Massive Datasets (MMDS) by Leskovec, et. al, particularly chapter 3, is a great reference for approximate similarity search.
- The Read Only Rest Plugin served as an example for much of the Gradle and testing setup.
- The Scalable Data Science Lectures on Youtube were helpful for better understanding LSH. I think much of that content is also based on the MMDS book.