-
Notifications
You must be signed in to change notification settings - Fork 23
Sparse Retrieval
Elias Bassani edited this page Feb 14, 2023
·
2 revisions
A Sparse Retriever is retrieval model based on lexical matching.
Classic search engine are based on sparse retrieval models, such as BM25 (used by Elasticsearch.
retriv
exposes two identical classes, SparseRetriever
and SearchEngine
, for using the sparse retrieval model BM25.
from retriv import SearchEngine
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses"},
{"id": "doc_2", "text": "Just like witches at black masses"},
{"id": "doc_3", "text": "Evil minds that plot destruction"},
{"id": "doc_4", "text": "Sorcerer of death's construction"},
]
se = SearchEngine("new-index")
se.index(collection)
se.search("witches masses")
Output:
[
{
"id": "doc_2",
"text": "Just like witches at black masses",
"score": 1.7536403
},
{
"id": "doc_1",
"text": "Generals gathered in their masses",
"score": 0.6931472
}
]
You can index a document collection from a JSONl, CSV, or TSV file.
CSV and TSV files must have a header.
File kind is automatically inferred.
Use the callback
parameter to pass a function for converting your documents in the format supported by retriv on the fly.
Indexes are automatically saved.
This is the preferred way of creating indexes as it has a low memory footprint.
from retriv import SearchEngine
se = SearchEngine("new-index")
se.index_file(
path="path/to/collection", # File kind is automatically inferred
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None
"id": doc["id"],
"text": doc["title"] + "\n" + doc["body"],
)
se = SearchEngine("new-index")
is equivalent to:
se = SearchEngine(
index_name="new-index", # Default value
min_df=1, # Min doc-frequency. Defaults to 1.
tokenizer="whitespace", # Default value
stemmer="english", # Default value (Snowball English)
stopwords="english", # Default value
spell_corrector=None, # Default value
do_lowercasing=True, # Default value
do_ampersand_normalization=True, # Default value
do_special_chars_normalization=True, # Default value
do_acronyms_normalization=True, # Default value
do_punctuation_removal=True, # Default value
)
collection = [
{"id": "doc_1", "title": "...", "body": "..."},
{"id": "doc_2", "title": "...", "body": "..."},
{"id": "doc_3", "title": "...", "body": "..."},
{"id": "doc_4", "title": "...", "body": "..."},
]
se = SearchEngine(...)
se.index(
collection,
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None
"id": doc["id"],
"text": doc["title"] + "\n" + doc["body"],
)
)
from retriv import SearchEngine
se = SearchEngine.load("index-name")
SearchEngine.delete("index-name")