SenTrEv📊

SenTrEv📊

Concepts📚

1. Embeddings

"In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning".

Source: Wikipedia - Word embedding

2. Vector database

"A Vector Database is a specialized system designed to efficiently handle high-dimensional vector data. It excels at indexing, querying, and retrieving this data, enabling advanced analysis and similarity searches".

Source: Qdrant - What is a vector database?

3. Sparse embeddings

"Sparse vectors [...] focus only on the essentials. In most sparse vectors, a large number of elements are zeros. When a feature or token is present, it’s marked—otherwise, zero. Sparse vectors, are used for exact matching and specific token-based identification".

Source: Qdrant - What is a vector database?

4. Dense embeddings

"Dense vectors are, quite literally, dense with information. Every element in the vector contributes to the semantic meaning, relationships and nuances of the data. [...] Together, they convey the overall meaning of the sentence, and are better for identifying contextually similar items".

Source: Qdrant - What is a vector database?

5. Retrieval

"Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need"

Source: Wikipedia - Information Retrieval

The problem🤔

Retrieval Augmented Generation (RAG) is growing in importance
As of December 2024, more than 10,000 models are available within the sentence-transformers python library
It's not always easy to understand what's the best embedding technique (sparse or dense) and the best embedding model for our use case
It's often complicated to evaluate embeddings on different data types

SenTrEv - A potential solution✅

Important

SenTrEv (Sentence Transformers Evaluator) is a python package that is aimed at running simple evaluation tests to help you choose the best embedding model for Retrieval Augmented Generation (RAG) with your text-based documents.

Installation⬇️

python3 -m pip install sentrev

Applicability💡

Highly integrated within the Qdrant environment
FastEmbed sparse encoding models
Sentence Transformers dense encoding models

Important

Supports most of the text-based file formats (.docx, .pptx, .pdf, .md, .html, .xml, .csv, .xlsx)

The workflow🔁

Metrics👑

Ranking

Success rate: defined as the number retrieval operation in which the correct context was retrieved ranking top among all the retrieved contexts, out of the total retrieval operations
Mean Reciprocal Ranking (MRR): MRR defines how high in ranking the correct context is placed among the retrieved results.

Relevance

Precision: Number of relevant documents out of the total number of retrieved documents. .
Non-Relevant Ratio: Number of non-relevant documents out of the total number of retrieved documents

Note

Relevance is based on the "page" metadata entry: if the retrieved document comes from the same page of the query, the document is considered relevant.

Time

Time performance: Average duration for a retrieval operation

Carbon emissions

Carbon emissions: Carbon emissions are calculated in gCO2eq (grams of CO2 equivalent) through the Python library codecarbon.

CO2 equivalent?

Note

This simply means that there are lots of other greenhouse gases (methane, clorofluorocarbons, nitric oxide…) which all have global warming potential: despite our emissions being mainly made up by CO2, they encompass also these other gases, and it is easier for us to express everything in terms of CO2. For example: 1 kg of emitted methane can be translated into producing 25 kg of CO2e.

How do we measure it?

codecarbon works with a scheduler and every 15s measures the carbon intensity of your running code based on the local grid (country or regional if on cloud)

"Carbon Intensity of the consumed electricity is calculated as a weighted average of the emissions from the different energy sources that are used to generate electricity, including fossil fuels and renewables." See here

Example usage🔎

1. Import necessary dependencies

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from fastembed import SparseTextEmbedding
from sentrev.evaluator import evaluate_dense_retrieval, evaluate_sparse_retrieval
import os

2. Prepare the embedding models

# Load all the dense embedding models
dense_encoder1 = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device="cuda")
dense_encoder2 = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2', device="cuda")
dense_encoder3 = SentenceTransformer('sentence-transformers/LaBSE', device="cuda")

# Create a list of the dense encoders
dense_encoders = [dense_encoder1, dense_encoder2, dense_encoder3]

# Create a dictionary that maps each encoder to its name
dense_encoder_to_names = { dense_encoder1: 'all-mpnet-base-v2', dense_encoder2: 'all-MiniLM-L12-v2', dense_encoder3: 'LaBSE'}

# Load all the sparse embedding models
sparse_encoder1 = SparseTextEmbedding("Qdrant/bm25")
sparse_encoder2 = SparseTextEmbedding("prithivida/Splade_PP_en_v1")
sparse_encoder3 = SparseTextEmbedding("Qdrant/bm42-all-minilm-l6-v2-attentions")

# Create a list of the sparse encoders
sparse_encoders = [sparse_encoder1, sparse_encoder2, sparse_encoder3]

# Create a dictionary that maps each sparse encoder to its name
sparse_encoder_to_names = { sparse_encoder1: 'BM25', sparse_encoder2: 'Splade', sparse_encoder3: 'BM42'}

3. Collect data

# Collect data
files = ["~/data/attention_is_all_you_need.pdf", "~/data/generative_adversarial_nets.pdf", "~/data/narration.docx", "~/data/call-to-action.html", "~/data/test.xml"]

4. Create Qdrant Client

Pull and run Qdrant locally with Docker:

docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

Use the python API to create the client:

# Create Qdrant client
client = QdrantClient("http://localhost:6333")

5. Evaluate retrieval!

# Define CSV path where the stats will be saved
csv_path_dense = "~/evals/dense_stats.csv"
csv_path_sparse = "~/evals/sparse_stats.csv"

# Run evaluation for dense retrieval
evaluate_dense_retrieval(files, dense_encoders, dense_encoder_to_names, client, csv_path_dense, chunking_size = 1500, text_percentage=0.3, distance="dot", mrr=10, carbon_tracking="AUT", plot=True)

# Run evaluation for sparse retrieval
evaluate_sparse_retrieval(files, sparse_encoders, sparse_encoder_to_names, client, csv_path_sparse, chunking_size = 1200, text_percentage=0.4, distance="euclid", mrr=10, carbon_tracking="AUT", plot=True)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
evals		evals
README.md		README.md
SenTrEvDemoNotebook.ipynb		SenTrEvDemoNotebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SenTrEv📊

Concepts📚

1. Embeddings

2. Vector database

3. Sparse embeddings

4. Dense embeddings

5. Retrieval

The problem🤔

SenTrEv - A potential solution✅

Installation⬇️

Applicability💡

The workflow🔁

Metrics👑

Ranking

Relevance

Time

Carbon emissions

CO2 equivalent?

How do we measure it?

Example usage🔎

1. Import necessary dependencies

2. Prepare the embedding models

3. Collect data

4. Create Qdrant Client

5. Evaluate retrieval!

6. Code reference

What to expect👀

1. Stats in a CSV

2. Plots

SenTrEv is in active development🚀

Thanks for the attention!🤗

About

Releases

Packages

Languages

AstraBert/SenTrEv-demo

Folders and files

Latest commit

History

Repository files navigation

SenTrEv📊

Concepts📚

1. Embeddings

2. Vector database

3. Sparse embeddings

4. Dense embeddings

5. Retrieval

The problem🤔

SenTrEv - A potential solution✅

Installation⬇️

Applicability💡

The workflow🔁

Metrics👑

Ranking

Relevance

Time

Carbon emissions

CO2 equivalent?

How do we measure it?

Example usage🔎

1. Import necessary dependencies

2. Prepare the embedding models

3. Collect data

4. Create Qdrant Client

5. Evaluate retrieval!

6. Code reference

What to expect👀

1. Stats in a CSV

2. Plots

SenTrEv is in active development🚀

Thanks for the attention!🤗

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages