GitHub - ddmitov/reteti: Lexical search based on LLM tokenizer and partitioned index in object storage

Reteti

Reteti is a work-in-progress lexical search experiment based on LLM tokenizer and partitioned index in object storage.

Design Objectives

1. Fast lexical search with index data based entirely on object storage
2. Usability in serverless or scale-to-zero applications for scalability and cost control
3. Adaptability to different cloud environments or on-premise systems

Features

Reteti combines a LLM tokenizer and a partitioned Parquet dataset in object storage.
The LLM tokenizer converts any text in any supported language to a list of integers.
All token integers with their positions are saved in the dataset under predictable file names.
Language-specific stemmers are not used.
Only the Parquet files of the tokens in the search request are contacted during search.
Positional token search is performed using SQL and DuckDB.
Storage and compute are decoupled and Reteti can be used in serverless functions.
Text data can be stored anywhere and Reteti index is independent of the text storage location.
Indexing and searching are completely separate processes.

Gradio demo using one million Bulgarian and English short articles is available on Fly.io.
It is scale-to-zero capable and its object storage is managed by Tigris Data.

Search Rules

Search Criteria

Reteti selects the ID numbers of texts that match the following criteria:

1. They have token occurences equal to or higher than the token occurences of the search request.
2. They have the full set of unique tokens presented in the search request.
3. They have one or more sequences of tokens identical to the sequence of tokens of the search request.

Ranking Criterion: Matching Tokens Frequency

The matching tokens frequency is the number of search request tokens found in a document divided by the number of all tokens in the same document. Short documents with high number of matching tokens are at the top of the results list.

Name

Reteti was one of three giraffe calfs orphaned during a severe drought around 2018 and saved thanks to the kindness and efforts of a local community in Kenya.

Today we use complex data processing technologies thanks to the knowledge, persistence and efforts of many people of a large global community. Just like the small Reteti, we owe much to this community and should always be thankful to its members for their goodwill and contributions!

Thanks and Credits

License

This program is licensed under the terms of the Apache License 2.0.

Author

Dimitar D. Mitov, 2024 - 2025

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CREDITS.md		CREDITS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
demo_indexer.py		demo_indexer.py
demo_searcher.py		demo_searcher.py
docker-compose-indexer.yml		docker-compose-indexer.yml
docker-compose-searcher.yml		docker-compose-searcher.yml
fly.toml		fly.toml
reteti_core.py		reteti_core.py
reteti_file.py		reteti_file.py
reteti_text.py		reteti_text.py
tokenizer_downloader.py		tokenizer_downloader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reteti

Design Objectives

Features

Search Rules

Search Criteria

Ranking Criterion: Matching Tokens Frequency

Name

Thanks and Credits

License

Author

About

Releases

Packages

Languages

License

ddmitov/reteti

Folders and files

Latest commit

History

Repository files navigation

Reteti

Design Objectives

Features

Search Rules

Search Criteria

Ranking Criterion: Matching Tokens Frequency

Name

Thanks and Credits

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages