This document includes some notes about development of Elastiknn.
You need at least the following software installed: git, Java 21, Python 3.10, SBT, docker, docker compose, and task. We're assuming the operating system is Linux or MacOS. There might be other software which is missing. If so, please submit an issue or PR.
The aws directory contains a Terraform file and instructions for creating a development instance in AWS.
Once you have the prerequisites installed, clone the project and run:
task jvmRunLocal
This starts a local instance of Elasticsearch with the plugin installed. It can take about five minutes the first time you run it.
Once you see "EXECUTING", you should open another shell and run curl localhost:9200
.
You should see the usual Elasticsearch JSON response containing the version, cluster name, etc.
Elastiknn currently consists of several subprojects managed by Task and Gradle:
- client-python - Python client.
- elastiknn-api4s - SBT project containing Scala case classes that model the Elastiknn API.
- elastiknn-client-elastic4s - SBT project containing a Scala client based on Elastic4s.
- elastiknn-lucene - SBT project containing custom Lucene queries implemented in Java.
- elastiknn-models - SBT project containing custom similarity models implemented in Java.
- elastiknn-plugin - SBT project containing the actual plugin implementation.
- elastiknn-testing - SBT project containing Scala tests for all the other Gradle subprojects.
- ann-benchmarks - Python project for benchmarking based on erikbern/ann-benchmarks.
The lucene
and models
sub-projects are implemented in Java for a few reasons:
- It makes it easier to ask questions on the Lucene issue tracker and mailing list.
- They are the most CPU-bound parts of the codebase. While Scala's abstractions are nicer than Java's, they sometimes have a surprising performance cost (e.g., boxing).
SBT manages the plugin and all the Java and Scala subprojects.
Task is used to define command aliases with simple dependencies. This makes it relatively easy to run tests, generate docs, publish artifacts, etc. all from one file.
I recommend using IntelliJ Idea to work on the SBT projects and Pycharm to work on the client-python project.
For IntelliJ, install the IntelliJ Scala plugin and open the elastiknn
directory in IntelliJ.
IntelliJ should recognize the SBT project.
You might have to specify the JDK and Scala SDK; as of April 2024, we're using JDK 21 and Scala 3.3.3.
Since early 2023, we're also using some experimental JDK features which also require some additional settings.
Go to Settings > Build, Execution, Deployment > Java Compiler, and add --add-modules jdk.incubator.vector --add-exports java.base/jdk.internal.vm.vector=ALL-UNNAMED --add-exports java.base/jdk.internal.vm.annotation=ALL-UNNAMED
to the "Additional command line parameters".
Then go to Settings > Build, Execution, Deployment > Scala Compiler, and add the same parameters in the "Additional compiler options".
For Python and Pycharm, you should first create a virtual environment in client-python/venv
.
You can do this by running task pyCreateVenv
.
Then you should configure PyCharm to use the interpreter in client-python/venv
.
Elastiknn has a fairly thorough test suite.
To run it, you'll first need to run task dockerRunTestingCluster
or task jvmRun
to start a local Elasticsearch server.
Then, run task jvmUnitTest
to run the SBT test suite, or task pyTest
to run the smaller Python test suite.
You can attach IntelliJ's debugger to a local Elasticsearch process. This can be immensely helpful when dealing with bugs or just figuring out how the code is structured.
First, open your project in IntelliJ and run the Debug Elasticsearch
target (usually in the upper right corner).
Then just run task jvmRunLocalDebug
in your terminal.
Now you can set and hit breakpoints in IntelliJ.
To try it out, open the RestPluginsAction.java file in IntelliJ, add a breakpoint in the getTableWithHeader
method, and run curl localhost:9200/_cat/plugins
.
IntelliJ should stop execution at your breakpoint.
Use task dockerRunTestingCluster
to run a local cluster with one master node and one data node (using docker compose).
There are a couple parts of the codebase that deal with serializing queries for use in a distributed environment.
Running this small local cluster exercises those code paths.
See ann-benchmarks/README.md
- To run Elasticsearch on Linux, you need to increase the
vm.max_map_count
setting. See the Elasticsearch docs. - To run ann-benchmarks on MacOS, you might need to
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
. See this Stackoverflow answer. - If you're running on MacOS 13.x (Ventura), the operating system's privacy settings might block
task jvmRunLocal
from starting. One solution is to go to System Settings > Privacy & Security > Developer Tools, and add and check your terminal (e.g., iTerm) to the list of developer apps. If that doesn't work, see this thread for more ideas: elastic/elasticsearch#91159. - When running tests from Intellij, you might need to add
--add-modules jdk.incubator.vector
to the VM options.
Nearest neighbors search is a large topic. Some good places to start are:
- Chapter 3 of Mining of Massive Datasets by Leskovec, et. al.
- Lectures 13-20 of this lecture series from IIT Kharagpur
- Assignment 1 of Stanford's CS231n course
- This work-in-progress literature review of nearest neighbor search methods related to Elasticsearch