Skip to content

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.

License

Notifications You must be signed in to change notification settings

alexklibisz/elastiknn

Repository files navigation

ElastiKnn

An Elasticsearch plugin for exact and approximate K-nearest-neighbors search in high-dimensional vector spaces.

Builds and Releases

Item Status
Github CI Build Github CI Status
Github Release Build Github Release Status
Plugin Release Plugin Release Status
Plugin Snapshot Plugin Snapshot Status
Python Client, Release Python Client Release Status
Scala 2.12 Client, Release Scala Client Release Status
Scala 2.12 Client, Snapshot Scala Client Snapshot Status

Work in Progress

This project is very much a work-in-progress. I've decided to go ahead and make the repo public since some people have expressed interest through emails and LinkedIn messages.

The "Features" section below is a good picture of my high level goals for this project. As of late January, I've completed an ingest processor, custom queries, a custom mapper to store ElastiKnnVectors, exact similarity queries for all five similarity functions, MinHash LSH for Jaccard similarity, and setup a fairly robust testing harness. The testing harness took a surprising amount of work.

If you want to contribute, you'll have to dig around quite a bit for now. The Makefile is a good place to start. Generally speaking, there are several Gradle projects: the plugin (currently on ES 7.4x); a "core" project containing protobuf definitions, models, and some utilities; a scala client based on the Elastic4s library; a reference project which compares elastiknn implementations to others (e.g. spark). There's also a Python client which provides a client roughly equivalent to the Scala one and a scikit-learn-style client.

Feel free to open issues and PRs, but I don't plan to initially spend a lot of time documenting or coordinating contributions until the code churn decreases. I'll do my best to keep the readme updated and I've made a Github project board to track ongoing work.

Features

  1. Exact nearest neighbors search. This should only be used for testing and on relatively small datasets.
  2. Approximate nearest neighbors search using Locality Sensitive Hashing (LSH) and Multiprobe LSH. This scales well for large datasets; see the Performance section below for details.
  3. Supports dense floating point vectors and sparse boolean vectors.
  4. Supports five distance functions: L1, L2, Angular, Hamming, and Jaccard.
  5. Supports the two most common nearest neighbors queries:
    • k nearest neighbors - i.e. "give me the k nearest neighbors to some query vector"
    • fixed-radius nearest neighbors - i.e. "give me all neighbors within some radius of a query vector"
  6. Integrates nearest neighbor queries with existing Elasticsearch queries.
  7. Horizontal scalability. Vectors are stored as regular Elasticsearch documents and queries are implemented using standard Elasticsearch constructs.

Usage

Install ElastiKnn on an ElasticSearch cluster

TODO

Run a Docker container with ElastiKnn already installed

TODO

Exact search using the Elasticsearch REST API

TODO

Python Client

TODO

Scala Client

TODO

Performance

Ann-Benchmarks

Currently working on this in a fork of the Ann-Benchmarks repo here. Planning to submit a PR when all of the approximate similarities are implemented and the Docker image can be built with a release elastiknn zip file.

Million-Scale

TODO

Planning to implement this using one of the various word vector datasets.

Billion-Scale

TODO

Not super sure of the feasability of this yet. There are some notes in benchmarks/billion.

Development

Builds and Releases

There are three main artifacts produced by this project:

  1. The actual plugin, which is a zip file published to Github releases.
  2. The python client library, which gets published to PyPi.
  3. The scala client library, which gets published to Sonatype.

All three artifacts are built and published as "snapshots" on every PR commit and every push/merge to master. All three artifacts are released on every push/merge to master in which the version file has changed. We detect a change in the version file by checking if a release tag exists with the same name as the version.

All of this is handled by Github Workflows with all steps defined in the yaml files in .github/workflows.

References

In no particular order: