Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use efficient kNN filtering, fix filtering when input value is array of string #16393

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tchoedak
Copy link

@tchoedak tchoedak commented Oct 6, 2024

Description

This MR introduces 2 changes.

  1. Update the default approximate search to use kNN with efficient filtering which is available as of opensearch 2.9. The current implementation only supports filtering using script scoring or painless scripting with pre-filtering. This document describes how efficient filtering has advantages over both pre-filtering and post-filtering. Efficient filtering opens the door towards more advanced use cases like supporting pagination more efficiently.

  2. A fix when building a filter for array-to-array based membership when the input array is a list of strings. Similar to how the equality_postfix ensures that the input field to search against is correct when input value is text, this fix ensures we also use the correct input field when filtering for an array of strings.

Here is an example query built without the fix when input is an array of strings:

{'size': 100,
 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'terms': {'metadata.location': ['Nevada',
            'California',
            'Illinois']}}]}}]}},
   'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))",
    'params': {'field': 'embedding', 'query_value': [0.1, 0.1, 0.1]}}}}}

With the fix, the rebuilt query looks like this:

{'size': 100,
 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'terms': {'metadata.location.keyword': ['Nevada',
            'California',
            'Illinois']}}]}}]}},
   'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))",
    'params': {'field': 'embedding', 'query_value': [0.1, 0.1, 0.1]}}}}}
  1. tests added for efficient filtering and the array of strings fix. Updated test fixture and all existing tests to use an index with lucene as the engine as well as the default nmslib engine.

NOTE: default behavior is that the OpensearchVectorClient will still initialize with engine=nmslib, and either painless or script scoring method is used for kNN searching when filters are applied.

There are no dependencies added for this change.

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

Tenzin Choedak added 2 commits October 6, 2024 13:22
…ucene or faiss. fix terms search when input is a list of strings.
…ucene or faiss. fix terms search when input is a list of strings.
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 6, 2024
@tchoedak
Copy link
Author

tchoedak commented Oct 7, 2024

@logan-markewich I'm confused on the coverage reporting:

  • 18% for opensearch/base.py
  • 29% for test_opensearch_client.py

I don't think I've drastically reduced coverage for base.py, and i'm not sure why test coverage on the test module itself is necessary.

@logan-markewich
Copy link
Collaborator

@tchoedak all the tests for open search are marked as skipif -- hence, there is basically zero coverage day-to-day.

Ideally there would be a properly mocked out client

I agree that the coverage probably shouldn't check the test file (although it would be at 100% if there was a mocked out client lol)

We recently added this check, so I won't hold it against you to write out a mocked client

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants