Skip to content

Commit

Permalink
Basically a rewrite: use custom Lucene queries, drop Protobuf depende…
Browse files Browse the repository at this point in the history
…ncy for API (#46)

- Remove the usage of Protobufs at the API level. Instead implemented a more idiomatic Elasticsearch API. Now using custom case classes in scala and data classes in Python, which is more tedious, but worth it for a more intuitive API. 
- Remove the pipelines in favor of processing/indexing vectors in the custom mapping. The model parameters are defined in the mapping and applied to any document field with type `elastiknn_sparse_bool_vector` or `elastiknn_dense_float_vector`. This eliminates the need for a pipeline/processor and the need to maintain custom mappings for the indexed vectors.
- Implement all queries using custom Lucene queries. This is tightly coupled to the custom mappings, since the mappings determine how vector hashes are stored and can be queried. For now I've been able to use very simple Lucene Term and Boolean queries.
- Add a "sparse indexed" mapping for jaccard and hamming similarities. This stores the indices of sparse boolean vectors as Lucene terms, allowing you to run a term query to get the intersection of the query vector against all stored vectors.
  • Loading branch information
alexklibisz authored Apr 3, 2020
1 parent 679b199 commit fbda811
Show file tree
Hide file tree
Showing 122 changed files with 2,741 additions and 2,856 deletions.
2 changes: 0 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,6 @@ jobs:
# Actual Build
- name: Compile JVM
run: make compile/gradle
- name: Compile Python
run: make compile/python
- name: Start Testing Cluster
run: make run/cluster
- name: Test JVM
Expand Down
32 changes: 6 additions & 26 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -32,36 +32,18 @@ clean:
.mk/client-python-install: .mk/client-python-venv
cd client-python \
&& $(vpip) install -q -r requirements.txt \
&& $(vpip) install -q grpcio-tools pytest mypy-protobuf twine
&& $(vpip) install -q pytest twine
touch $@

.mk/gradle-compile: $(src_all)
$(gradle) compileScala compileJava compileTestScala compileTestJava
touch $@

.mk/gradle-gen-proto: $(src_all)
$(gradle) generateProto
touch $@

.mk/gradle-publish-local: version $(src_all)
$(gradle) assemble publishToMavenLocal
touch $@

.mk/client-python-compile: .mk/client-python-install .mk/gradle-gen-proto
cd client-python \
&& cp $(core)/src/main/proto/elastiknn/elastiknn.proto elastiknn \
&& $(vpy) -m grpc_tools.protoc \
--proto_path=$(core)/src/main/proto \
--proto_path=$(core)/build/extracted-include-protos/main \
--python_out=. \
--plugin=protoc-gen-mypy=venv/bin/protoc-gen-mypy \
--mypy_out=. \
$(core)/src/main/proto/elastiknn/elastiknn.proto \
$(core)/build/extracted-include-protos/main/scalapb/scalapb.proto \
&& $(vpy) -c "from elastiknn.elastiknn_pb2 import Similarity; x = Similarity.values()"
touch $@

.mk/client-python-publish-local: version .mk/client-python-compile
.mk/client-python-publish-local: version
cd client-python && rm -rf dist && $(vpy) setup.py sdist bdist_wheel && ls dist
touch $@

Expand All @@ -79,9 +61,7 @@ clean:

compile/gradle: .mk/gradle-compile

compile/python: .mk/client-python-compile

compile: compile/gradle compile/python
compile: compile/gradle

run/cluster: .mk/run-cluster

Expand All @@ -91,19 +71,19 @@ run/gradle:

run/debug:
cd testing && $(dc) down
$(gradle) clean run --debug-jvm
$(gradle) run --debug-jvm

run/kibana:
docker run --network host -e ELASTICSEARCH_HOSTS=http://localhost:9200 -p 5601:5601 -d --rm kibana:7.4.0
docker ps | grep kibana

test/python:
test/python: .mk/client-python-install
cd client-python && $(vpy) -m pytest

test/gradle:
$(gradle) test

test: clean compile/python run/cluster
test: clean run/cluster
$(MAKE) test/gradle
$(MAKE) test/python

Expand Down
11 changes: 11 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
- Remove the usage of Protobufs at the API level. Instead implemented a more idiomatic Elasticsearch API. Now using c
ustom case classes in scala and data classes in Python, which is more tedious, but worth it for a more intuitive API.
- Remove the pipelines in favor of processing/indexing vectors in the custom mapping. The model parameters are defined in
the mapping and applied to any document field with type `elastiknn_sparse_bool_vector` or `elastiknn_dense_float_vector`.
This eliminates the need for a pipeline/processor and the need to maintain custom mappings for the indexed vectors.
- Implement all queries using custom Lucene queries. This is tightly coupled to the custom mappings, since the mappings
determine how vector hashes are stored and can be queried. For now I've been able to use very simple Lucene Term and
Boolean queries.
- Add a "sparse indexed" mapping for jaccard and hamming similarities. This stores the indices of sparse boolean vectors
as Lucene terms, allowing you to run a term query to get the intersection of the query vector against all stored vectors.
---
- Removed the `num_tables` argument from `JaccardLshOptions` as it's redundant to `num_bands`.
- Profiled and refactored the `JaccardLshModel` using the Ann-benchmarks Kosarak Jaccard dataset.
- Added an example program that grid-searches JaccardLshOptions for best performance and plots the Pareto front.
Expand Down

This file was deleted.

This file was deleted.

Loading

0 comments on commit fbda811

Please sign in to comment.