Basically a rewrite: use custom Lucene queries, drop Protobuf depende…

…ncy for API (#46) - Remove the usage of Protobufs at the API level. Instead implemented a more idiomatic Elasticsearch API. Now using custom case classes in scala and data classes in Python, which is more tedious, but worth it for a more intuitive API. - Remove the pipelines in favor of processing/indexing vectors in the custom mapping. The model parameters are defined in the mapping and applied to any document field with type `elastiknn_sparse_bool_vector` or `elastiknn_dense_float_vector`. This eliminates the need for a pipeline/processor and the need to maintain custom mappings for the indexed vectors. - Implement all queries using custom Lucene queries. This is tightly coupled to the custom mappings, since the mappings determine how vector hashes are stored and can be queried. For now I've been able to use very simple Lucene Term and Boolean queries. - Add a "sparse indexed" mapping for jaccard and hamming similarities. This stores the indices of sparse boolean vectors as Lucene terms, allowing you to run a term query to get the intersection of the query vector against all stored vectors.
alexklibisz · Apr 3, 2020 · fbda811 · fbda811
1 parent 679b199
commit fbda811
Show file tree

Hide file tree

Showing 122 changed files with 2,741 additions and 2,856 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -70,8 +70,6 @@ jobs:
     # Actual Build
     - name: Compile JVM
       run: make compile/gradle
-    - name: Compile Python
-      run: make compile/python
     - name: Start Testing Cluster
       run: make run/cluster
     - name: Test JVM

diff --git a/Makefile b/Makefile
@@ -32,36 +32,18 @@ clean:
 .mk/client-python-install: .mk/client-python-venv
 	cd client-python \
 		&& $(vpip) install -q -r requirements.txt \
-		&& $(vpip) install -q grpcio-tools pytest mypy-protobuf twine
+		&& $(vpip) install -q pytest twine
 	touch $@
 
 .mk/gradle-compile: $(src_all)
 	$(gradle) compileScala compileJava compileTestScala compileTestJava
 	touch $@
 
-.mk/gradle-gen-proto: $(src_all)
-	$(gradle) generateProto
-	touch $@
-
 .mk/gradle-publish-local: version $(src_all)
 	$(gradle) assemble publishToMavenLocal
 	touch $@
 
-.mk/client-python-compile: .mk/client-python-install .mk/gradle-gen-proto
-	cd client-python \
-		&& cp $(core)/src/main/proto/elastiknn/elastiknn.proto elastiknn \
-		&& $(vpy) -m grpc_tools.protoc \
-			--proto_path=$(core)/src/main/proto \
-			--proto_path=$(core)/build/extracted-include-protos/main \
-			--python_out=. \
-			--plugin=protoc-gen-mypy=venv/bin/protoc-gen-mypy \
-			--mypy_out=. \
-			$(core)/src/main/proto/elastiknn/elastiknn.proto \
-			$(core)/build/extracted-include-protos/main/scalapb/scalapb.proto \
-		&& $(vpy) -c "from elastiknn.elastiknn_pb2 import Similarity; x = Similarity.values()"
-	touch $@
-
-.mk/client-python-publish-local: version .mk/client-python-compile
+.mk/client-python-publish-local: version
 	cd client-python && rm -rf dist && $(vpy) setup.py sdist bdist_wheel && ls dist
 	touch $@
 
@@ -79,9 +61,7 @@ clean:
 
 compile/gradle: .mk/gradle-compile
 
-compile/python: .mk/client-python-compile
-
-compile: compile/gradle compile/python
+compile: compile/gradle
 
 run/cluster: .mk/run-cluster
 
@@ -91,19 +71,19 @@ run/gradle:
 
 run/debug:
 	cd testing && $(dc) down
-	$(gradle) clean run --debug-jvm
+	$(gradle) run --debug-jvm
 
 run/kibana:
 	docker run --network host -e ELASTICSEARCH_HOSTS=http://localhost:9200 -p 5601:5601 -d --rm kibana:7.4.0
 	docker ps | grep kibana
 
-test/python:
+test/python: .mk/client-python-install
 	cd client-python && $(vpy) -m pytest
 
 test/gradle:
 	$(gradle) test
 
-test: clean compile/python run/cluster
+test: clean run/cluster
 	$(MAKE) test/gradle
 	$(MAKE) test/python
 

diff --git a/changelog.md b/changelog.md
@@ -1,3 +1,14 @@
+- Remove the usage of Protobufs at the API level. Instead implemented a more idiomatic Elasticsearch API. Now using c
+ustom case classes in scala and data classes in Python, which is more tedious, but worth it for a more intuitive API. 
+- Remove the pipelines in favor of processing/indexing vectors in the custom mapping. The model parameters are defined in 
+the mapping and applied to any document field with type `elastiknn_sparse_bool_vector` or `elastiknn_dense_float_vector`.
+This eliminates the need for a pipeline/processor and the need to maintain custom mappings for the indexed vectors.
+- Implement all queries using custom Lucene queries. This is tightly coupled to the custom mappings, since the mappings
+determine how vector hashes are stored and can be queried. For now I've been able to use very simple Lucene Term and
+Boolean queries.
+- Add a "sparse indexed" mapping for jaccard and hamming similarities. This stores the indices of sparse boolean vectors 
+as Lucene terms, allowing you to run a term query to get the intersection of the query vector against all stored vectors.  
+---
 - Removed the `num_tables` argument from `JaccardLshOptions` as it's redundant to `num_bands`.
 - Profiled and refactored the `JaccardLshModel` using the Ann-benchmarks Kosarak Jaccard dataset.
 - Added an example program that grid-searches JaccardLshOptions for best performance and plots the Pareto front.

diff --git a/client-elastic4s/src/main/scala/com/klibisz/elastiknn/client/ElastiKnnClient.scala b/client-elastic4s/src/main/scala/com/klibisz/elastiknn/client/ElastiKnnClient.scala
diff --git a/client-elastic4s/src/main/scala/com/klibisz/elastiknn/client/Elastic4sUtils.scala b/client-elastic4s/src/main/scala/com/klibisz/elastiknn/client/Elastic4sUtils.scala