Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
.github		.github
.mk		.mk
benchmarks		benchmarks
client-elastic4s		client-elastic4s
client-python		client-python
core		core
es74x		es74x
examples/scala-sbt-client-usage		examples/scala-sbt-client-usage
gradle/wrapper		gradle/wrapper
notes		notes
reference		reference
testing		testing
.env.sh		.env.sh
.gitattributes		.gitattributes
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE.txt		LICENSE.txt
Makefile		Makefile
NOTICE.txt		NOTICE.txt
build.gradle		build.gradle
changelog.md		changelog.md
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
readme.md		readme.md
settings.gradle		settings.gradle
version		version

Repository files navigation

ElastiKnn

An Elasticsearch plugin for exact and approximate K-nearest-neighbors search in high-dimensional vector spaces.

Builds and Releases

Item	Status
Github CI Build
Github Release Build
Plugin Release
Plugin Snapshot
Python Client, Release
Scala 2.12 Client, Release
Scala 2.12 Client, Snapshot

Work in Progress

This project is very much a work-in-progress. I've decided to go ahead and make the repo public since some people have expressed interest through emails and LinkedIn messages.

The "Features" section below is a good picture of my high level goals for this project. As of late January, I've completed an ingest processor, custom queries, a custom mapper to store ElastiKnnVectors, exact similarity queries for all five similarity functions, MinHash LSH for Jaccard similarity, and setup a fairly robust testing harness. The testing harness took a surprising amount of work.

If you want to contribute, you'll have to dig around quite a bit for now. The Makefile is a good place to start. Generally speaking, there are several Gradle projects: the plugin (currently on ES 7.4x); a "core" project containing protobuf definitions, models, and some utilities; a scala client based on the Elastic4s library; a reference project which compares elastiknn implementations to others (e.g. spark). There's also a Python client which provides a client roughly equivalent to the Scala one and a scikit-learn-style client.

Feel free to open issues and PRs, but I don't plan to initially spend a lot of time documenting or coordinating contributions until the code churn decreases. I'll do my best to keep the readme updated and I've made a Github project board to track ongoing work.

Features

Exact nearest neighbors search. This should only be used for testing and on relatively small datasets.
Approximate nearest neighbors search using Locality Sensitive Hashing (LSH) and Multiprobe LSH. This scales well for large datasets; see the Performance section below for details.
Supports dense floating point vectors and sparse boolean vectors.
Supports five distance functions: L1, L2, Angular, Hamming, and Jaccard.
Supports the two most common nearest neighbors queries:
- k nearest neighbors - i.e. "give me the k nearest neighbors to some query vector"
- fixed-radius nearest neighbors - i.e. "give me all neighbors within some radius of a query vector"
Integrates nearest neighbor queries with existing Elasticsearch queries.
Horizontal scalability. Vectors are stored as regular Elasticsearch documents and queries are implemented using standard Elasticsearch constructs.

Usage

Install ElastiKnn on an ElasticSearch cluster

TODO

Run a Docker container with ElastiKnn already installed

TODO

Exact search using the Elasticsearch REST API

TODO

Python Client

TODO

Scala Client

TODO

Performance

Ann-Benchmarks

Currently working on this in a fork of the Ann-Benchmarks repo here. Planning to submit a PR when all of the approximate similarities are implemented and the Docker image can be built with a release elastiknn zip file.

Million-Scale

TODO

Planning to implement this using one of the various word vector datasets.

Billion-Scale

TODO

Not super sure of the feasability of this yet. There are some notes in benchmarks/billion.

Development

Builds and Releases

There are three main artifacts produced by this project:

The actual plugin, which is a zip file published to Github releases.
The python client library, which gets published to PyPi.
The scala client library, which gets published to Sonatype.

All three artifacts are built and published as "snapshots" on every PR commit and every push/merge to master. All three artifacts are released on every push/merge to master in which the version file has changed. We detect a change in the version file by checking if a release tag exists with the same name as the version.

All of this is handled by Github Workflows with all steps defined in the yaml files in .github/workflows.

References

In no particular order:

Alex Reelsen has several open-source plugins which were useful examples for the general structure of a plugin project:
Mining of Massive Datasets (MMDS) by Leskovec, et. al, particularly chapter 3, is a great reference for approximate similarity search.
The Read Only Rest Plugin served as an example for much of the Gradle and testing setup.
The Scalable Data Science Lectures on Youtube were helpful for better understanding LSH. I think much of that content is also based on the MMDS book.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ElastiKnn

Builds and Releases

Work in Progress

Features

Usage

Install ElastiKnn on an ElasticSearch cluster

Run a Docker container with ElastiKnn already installed

Exact search using the Elasticsearch REST API

Python Client

Scala Client

Performance

Ann-Benchmarks

Million-Scale

Billion-Scale

Development

Builds and Releases

References

About

Releases 169

Packages

Contributors 9

Languages

License

alexklibisz/elastiknn

Folders and files

Latest commit

History

Repository files navigation

ElastiKnn

Builds and Releases

Work in Progress

Features

Usage

Install ElastiKnn on an ElasticSearch cluster

Run a Docker container with ElastiKnn already installed

Exact search using the Elasticsearch REST API

Python Client

Scala Client

Performance

Ann-Benchmarks

Million-Scale

Billion-Scale

Development

Builds and Releases

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 169

Packages 0

Contributors 9

Languages

Packages