diff --git a/notebooks/lemone_embed_notebook_tutorial.ipynb b/notebooks/lemone_embed_notebook_tutorial.ipynb
new file mode 100644
index 0000000..cbc8074
--- /dev/null
+++ b/notebooks/lemone_embed_notebook_tutorial.ipynb
@@ -0,0 +1,397 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "toc_visible": true,
+ "authorship_tag": "ABX9TyOzdQuEjEqIX9Gcuv/hESlK",
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "\n",
+ "# Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation\n",
+ "[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n",
+ "\n",
+ "
\n",
+ "
This series is made up of 7 models, 3 basic models of different sizes trained on 1 epoch, 3 models trained on 2 epochs making up the Boost series and a Pro model with a non-Roberta architecture.
\n",
+ "
\n",
+ "\n",
+ "This sentence transformers model, specifically designed for French taxation, has been fine-tuned on a dataset comprising 43 million tokens, integrating a blend of semi-synthetic and fully synthetic data generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation.\n",
+ "\n",
+ "The model is tailored to meet the specific demands of information retrieval across large-scale tax-related corpora, supporting the implementation of production-ready Retrieval-Augmented Generation (RAG) applications. Its primary purpose is to enhance the efficiency and accuracy of legal processes in the taxation domain, with an emphasis on delivering consistent performance in real-world settings, while also contributing to advancements in legal natural language processing research.\n",
+ "\n",
+ "This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.\n",
+ "\n",
+ "If you use this code in your research, please use the following BibTeX entry.\n",
+ "\n",
+ "```BibTeX\n",
+ "@misc{louisbrulenaudet2024,\n",
+ " author = {Louis Brulé Naudet},\n",
+ " title = {Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation},\n",
+ " year = {2024}\n",
+ " howpublished = {\\url{https://huggingface.co/datasets/louisbrulenaudet/lemone-embed-pro}},\n",
+ "}\n",
+ "```\n",
+ "\n",
+ "## Feedback\n",
+ "\n",
+ "If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com)."
+ ],
+ "metadata": {
+ "id": "jus7eI3ptMg_"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Collecting and installing dependencies"
+ ],
+ "metadata": {
+ "id": "X_nanITItWoB"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip3 install chromadb polars datasets sentence-transformers huggingface_hub"
+ ],
+ "metadata": {
+ "id": "RBZN_of-tZBl"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Importing packages\n",
+ "\n",
+ "## Core Database and Data Processing\n",
+ "\n",
+ "- ChromaDB: A specialized vector database that will be used to store and query our embeddings efficiently\n",
+ "- Polars: A modern, high-performance DataFrame library chosen as an alternative to pandas for data manipulation tasks\n",
+ "\n",
+ "## Machine Learning Infrastructure\n",
+ "\n",
+ "- Datasets: Integration with Hugging Face's dataset library for streamlined data handling\n",
+ "- PyTorch CUDA: Capability check for GPU acceleration to optimize model performance\n",
+ "\n",
+ "## Utility Components\n",
+ "\n",
+ "- Hashlib: Implementation of secure hash functions, likely used for creating unique identifiers for documents or embeddings\n",
+ "- Datetime: Temporal data handling for tracking embedding creation and modifications\n",
+ "- Type Hints: Comprehensive typing imports for enhanced code documentation and maintainability"
+ ],
+ "metadata": {
+ "id": "ujkbUgpZtcTn"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import hashlib\n",
+ "\n",
+ "from datetime import datetime\n",
+ "from typing import (\n",
+ " IO,\n",
+ " TYPE_CHECKING,\n",
+ " Any,\n",
+ " Dict,\n",
+ " List,\n",
+ " Type,\n",
+ " Tuple,\n",
+ " Union,\n",
+ " Mapping,\n",
+ " TypeVar,\n",
+ " Callable,\n",
+ " Optional,\n",
+ " Sequence,\n",
+ ")\n",
+ "\n",
+ "import chromadb\n",
+ "import polars as pl\n",
+ "\n",
+ "from chromadb.config import Settings\n",
+ "from chromadb.utils import embedding_functions\n",
+ "from datasets import Dataset\n",
+ "from torch.cuda import is_available"
+ ],
+ "metadata": {
+ "id": "lWVZ_-Kytr-g"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Datasets registration\n",
+ "\n",
+ "This cell loads a Parquet dataset from Hugging Face's repository (lemone-docs-embeded) using Polars' efficient lazy loading method (scan_parquet), filters out any rows with null values in the 'text' column to ensure data quality, and finally materializes the data into memory with .collect() for further processing."
+ ],
+ "metadata": {
+ "id": "JXimNwAltfOk"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "dataframe = pl.scan_parquet(\n",
+ " \"hf://datasets/louisbrulenaudet/lemone-docs-embeded/data/train-00000-of-00001.parquet\"\n",
+ ").filter(\n",
+ " pl.col(\n",
+ " \"text\"\n",
+ " ).is_not_null()\n",
+ ").collect()"
+ ],
+ "metadata": {
+ "id": "J32rtjmjt4cB"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "If you want to re-create your dataset from the source, here is a code snippet that will help you:"
+ ],
+ "metadata": {
+ "id": "tolO_edV1Cme"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "bofip_dataframe = pl.scan_parquet(\n",
+ " \"hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet\"\n",
+ ").with_columns(\n",
+ " [\n",
+ " (\n",
+ " pl.lit(\"Bulletin officiel des finances publiques - impôts\").alias(\n",
+ " \"title_main\"\n",
+ " )\n",
+ " ),\n",
+ " (\n",
+ " pl.col(\"debut_de_validite\")\n",
+ " .str.strptime(pl.Date, format=\"%Y-%m-%d\")\n",
+ " .dt.strftime(\"%Y-%m-%d 00:00:00\")\n",
+ " ).alias(\"date_publication\"),\n",
+ " (\n",
+ " pl.col(\"contenu\")\n",
+ " .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n",
+ " .alias(\"hash\")\n",
+ " )\n",
+ " ]\n",
+ ").rename(\n",
+ " {\n",
+ " \"contenu\": \"text\",\n",
+ " \"permalien\": \"url_sourcepage\",\n",
+ " \"identifiant_juridique\": \"id_sub\",\n",
+ " }\n",
+ ").select(\n",
+ " [\n",
+ " \"text\",\n",
+ " \"title_main\",\n",
+ " \"id_sub\",\n",
+ " \"url_sourcepage\",\n",
+ " \"date_publication\",\n",
+ " \"hash\"\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "books: List[str] = [\n",
+ " \"hf://datasets/louisbrulenaudet/code-douanes/data/train-00000-of-00001.parquet\",\n",
+ " \"hf://datasets/louisbrulenaudet/code-impots/data/train-00000-of-00001.parquet\",\n",
+ " \"hf://datasets/louisbrulenaudet/code-impots-annexe-i/data/train-00000-of-00001.parquet\",\n",
+ " \"hf://datasets/louisbrulenaudet/code-impots-annexe-ii/data/train-00000-of-00001.parquet\",\n",
+ " \"hf://datasets/louisbrulenaudet/code-impots-annexe-iii/data/train-00000-of-00001.parquet\",\n",
+ " \"hf://datasets/louisbrulenaudet/code-impots-annexe-iv/data/train-00000-of-00001.parquet\",\n",
+ " \"hf://datasets/louisbrulenaudet/code-impositions-biens-services/data/train-00000-of-00001.parquet\",\n",
+ " \"hf://datasets/louisbrulenaudet/livre-procedures-fiscales/data/train-00000-of-00001.parquet\"\n",
+ "]\n",
+ "\n",
+ "legi_dataframe = pl.concat(\n",
+ " [\n",
+ " pl.scan_parquet(\n",
+ " book\n",
+ " ) for book in books\n",
+ " ]\n",
+ ").with_columns(\n",
+ " [\n",
+ " (\n",
+ " pl.lit(\"https://www.legifrance.gouv.fr/codes/article_lc/\")\n",
+ " .add(pl.col(\"id\"))\n",
+ " .alias(\"url_sourcepage\")\n",
+ " ),\n",
+ " (\n",
+ " pl.col(\"dateDebut\")\n",
+ " .cast(pl.Int64)\n",
+ " .map_elements(\n",
+ " lambda x: datetime.fromtimestamp(x / 1000).strftime(\"%Y-%m-%d %H:%M:%S\"),\n",
+ " return_dtype=pl.Utf8\n",
+ " )\n",
+ " .alias(\"date_publication\")\n",
+ " ),\n",
+ " (\n",
+ " pl.col(\"texte\")\n",
+ " .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n",
+ " .alias(\"hash\")\n",
+ " )\n",
+ " ]\n",
+ ").rename(\n",
+ " {\n",
+ " \"texte\": \"text\",\n",
+ " \"num\": \"id_sub\",\n",
+ " }\n",
+ ").select(\n",
+ " [\n",
+ " \"text\",\n",
+ " \"title_main\",\n",
+ " \"id_sub\",\n",
+ " \"url_sourcepage\",\n",
+ " \"date_publication\",\n",
+ " \"hash\"\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "print(\"Starting embeddings production...\")\n",
+ "\n",
+ "dataframe = pl.concat(\n",
+ " [\n",
+ " bofip_dataframe,\n",
+ " legi_dataframe\n",
+ " ]\n",
+ ").filter(\n",
+ " pl.col(\n",
+ " \"text\"\n",
+ " ).is_not_null()\n",
+ ").with_columns(\n",
+ " pl.col(\"text\").map_elements(\n",
+ " lambda x: sentence_transformer_ef(\n",
+ " [x]\n",
+ " )[0].tolist(),\n",
+ " return_dtype=pl.List(pl.Float64)\n",
+ " ).alias(\"lemone_pro_embeddings\")\n",
+ ").collect()"
+ ],
+ "metadata": {
+ "id": "KkOYEOeQ1Kcn"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Index creation\n",
+ "\n",
+ "This cell initializes a ChromaDB client with telemetry disabled, sets up a SentenceTransformer embedding model (using \"lemone-embed-pro\" with GPU acceleration if available), and creates or retrieves a collection named \"tax\" that will store the document embeddings using this model configuration."
+ ],
+ "metadata": {
+ "id": "PX2NybWKthV7"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "client = chromadb.Client(\n",
+ " settings=Settings(anonymized_telemetry=False)\n",
+ ")\n",
+ "\n",
+ "sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(\n",
+ " model_name=\"louisbrulenaudet/lemone-embed-pro\",\n",
+ " device=\"cuda\" if is_available() else \"cpu\",\n",
+ " trust_remote_code=True\n",
+ ")\n",
+ "\n",
+ "collection = client.get_or_create_collection(\n",
+ " name=\"tax\",\n",
+ " embedding_function=sentence_transformer_ef\n",
+ ")"
+ ],
+ "metadata": {
+ "id": "T9OHkgaIt9Ki"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Populates the ChromaDB collection by adding document embeddings from the \"lemone_pro_embeddings\" column, their corresponding text content, all remaining columns as metadata, and automatically generated sequential IDs for each document.\n"
+ ],
+ "metadata": {
+ "id": "fGQHsmjCvuZW"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "collection.add(\n",
+ " embeddings=dataframe[\"lemone_pro_embeddings\"].to_list(),\n",
+ " documents=dataframe[\"text\"].to_list(),\n",
+ " metadatas=dataframe.remove_columns(\n",
+ " [\n",
+ " \"lemone_pro_embeddings\",\n",
+ " \"text\"\n",
+ " ]\n",
+ " ).to_list(),\n",
+ " ids=[\n",
+ " str(i) for i in range(0, dataframe.shape[0])\n",
+ " ]\n",
+ ")"
+ ],
+ "metadata": {
+ "id": "VjC22bRauAk-"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Collection querying"
+ ],
+ "metadata": {
+ "id": "BVJWOhhW3vjW"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "collection.query(\n",
+ " query_texts=[\"Les personnes morales de droit public ne sont pas assujetties à la taxe sur la valeur ajoutée pour l'activité de leurs services administratifs, sociaux, éducatifs, culturels et sportifs lorsque leur non-assujettissement n'entraîne pas de distorsions dans les conditions de la concurrence.\"],\n",
+ " n_results=10,\n",
+ ")"
+ ],
+ "metadata": {
+ "id": "-xdrJPCRuBQ4"
+ },
+ "execution_count": null,
+ "outputs": []
+ }
+ ]
+}
\ No newline at end of file