diff --git a/notebooks/lemone_embed_notebook_tutorial.ipynb b/notebooks/lemone_embed_notebook_tutorial.ipynb new file mode 100644 index 0000000..cbc8074 --- /dev/null +++ b/notebooks/lemone_embed_notebook_tutorial.ipynb @@ -0,0 +1,397 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true, + "authorship_tag": "ABX9TyOzdQuEjEqIX9Gcuv/hESlK", + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "# Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation\n", + "[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n", + "\n", + "
\n", + "

This series is made up of 7 models, 3 basic models of different sizes trained on 1 epoch, 3 models trained on 2 epochs making up the Boost series and a Pro model with a non-Roberta architecture.

\n", + "
\n", + "\n", + "This sentence transformers model, specifically designed for French taxation, has been fine-tuned on a dataset comprising 43 million tokens, integrating a blend of semi-synthetic and fully synthetic data generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation.\n", + "\n", + "The model is tailored to meet the specific demands of information retrieval across large-scale tax-related corpora, supporting the implementation of production-ready Retrieval-Augmented Generation (RAG) applications. Its primary purpose is to enhance the efficiency and accuracy of legal processes in the taxation domain, with an emphasis on delivering consistent performance in real-world settings, while also contributing to advancements in legal natural language processing research.\n", + "\n", + "This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.\n", + "\n", + "If you use this code in your research, please use the following BibTeX entry.\n", + "\n", + "```BibTeX\n", + "@misc{louisbrulenaudet2024,\n", + " author = {Louis Brulé Naudet},\n", + " title = {Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation},\n", + " year = {2024}\n", + " howpublished = {\\url{https://huggingface.co/datasets/louisbrulenaudet/lemone-embed-pro}},\n", + "}\n", + "```\n", + "\n", + "## Feedback\n", + "\n", + "If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com)." + ], + "metadata": { + "id": "jus7eI3ptMg_" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Collecting and installing dependencies" + ], + "metadata": { + "id": "X_nanITItWoB" + } + }, + { + "cell_type": "code", + "source": [ + "!pip3 install chromadb polars datasets sentence-transformers huggingface_hub" + ], + "metadata": { + "id": "RBZN_of-tZBl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Importing packages\n", + "\n", + "## Core Database and Data Processing\n", + "\n", + "- ChromaDB: A specialized vector database that will be used to store and query our embeddings efficiently\n", + "- Polars: A modern, high-performance DataFrame library chosen as an alternative to pandas for data manipulation tasks\n", + "\n", + "## Machine Learning Infrastructure\n", + "\n", + "- Datasets: Integration with Hugging Face's dataset library for streamlined data handling\n", + "- PyTorch CUDA: Capability check for GPU acceleration to optimize model performance\n", + "\n", + "## Utility Components\n", + "\n", + "- Hashlib: Implementation of secure hash functions, likely used for creating unique identifiers for documents or embeddings\n", + "- Datetime: Temporal data handling for tracking embedding creation and modifications\n", + "- Type Hints: Comprehensive typing imports for enhanced code documentation and maintainability" + ], + "metadata": { + "id": "ujkbUgpZtcTn" + } + }, + { + "cell_type": "code", + "source": [ + "import hashlib\n", + "\n", + "from datetime import datetime\n", + "from typing import (\n", + " IO,\n", + " TYPE_CHECKING,\n", + " Any,\n", + " Dict,\n", + " List,\n", + " Type,\n", + " Tuple,\n", + " Union,\n", + " Mapping,\n", + " TypeVar,\n", + " Callable,\n", + " Optional,\n", + " Sequence,\n", + ")\n", + "\n", + "import chromadb\n", + "import polars as pl\n", + "\n", + "from chromadb.config import Settings\n", + "from chromadb.utils import embedding_functions\n", + "from datasets import Dataset\n", + "from torch.cuda import is_available" + ], + "metadata": { + "id": "lWVZ_-Kytr-g" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Datasets registration\n", + "\n", + "This cell loads a Parquet dataset from Hugging Face's repository (lemone-docs-embeded) using Polars' efficient lazy loading method (scan_parquet), filters out any rows with null values in the 'text' column to ensure data quality, and finally materializes the data into memory with .collect() for further processing." + ], + "metadata": { + "id": "JXimNwAltfOk" + } + }, + { + "cell_type": "code", + "source": [ + "dataframe = pl.scan_parquet(\n", + " \"hf://datasets/louisbrulenaudet/lemone-docs-embeded/data/train-00000-of-00001.parquet\"\n", + ").filter(\n", + " pl.col(\n", + " \"text\"\n", + " ).is_not_null()\n", + ").collect()" + ], + "metadata": { + "id": "J32rtjmjt4cB" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "If you want to re-create your dataset from the source, here is a code snippet that will help you:" + ], + "metadata": { + "id": "tolO_edV1Cme" + } + }, + { + "cell_type": "code", + "source": [ + "bofip_dataframe = pl.scan_parquet(\n", + " \"hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet\"\n", + ").with_columns(\n", + " [\n", + " (\n", + " pl.lit(\"Bulletin officiel des finances publiques - impôts\").alias(\n", + " \"title_main\"\n", + " )\n", + " ),\n", + " (\n", + " pl.col(\"debut_de_validite\")\n", + " .str.strptime(pl.Date, format=\"%Y-%m-%d\")\n", + " .dt.strftime(\"%Y-%m-%d 00:00:00\")\n", + " ).alias(\"date_publication\"),\n", + " (\n", + " pl.col(\"contenu\")\n", + " .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n", + " .alias(\"hash\")\n", + " )\n", + " ]\n", + ").rename(\n", + " {\n", + " \"contenu\": \"text\",\n", + " \"permalien\": \"url_sourcepage\",\n", + " \"identifiant_juridique\": \"id_sub\",\n", + " }\n", + ").select(\n", + " [\n", + " \"text\",\n", + " \"title_main\",\n", + " \"id_sub\",\n", + " \"url_sourcepage\",\n", + " \"date_publication\",\n", + " \"hash\"\n", + " ]\n", + ")\n", + "\n", + "books: List[str] = [\n", + " \"hf://datasets/louisbrulenaudet/code-douanes/data/train-00000-of-00001.parquet\",\n", + " \"hf://datasets/louisbrulenaudet/code-impots/data/train-00000-of-00001.parquet\",\n", + " \"hf://datasets/louisbrulenaudet/code-impots-annexe-i/data/train-00000-of-00001.parquet\",\n", + " \"hf://datasets/louisbrulenaudet/code-impots-annexe-ii/data/train-00000-of-00001.parquet\",\n", + " \"hf://datasets/louisbrulenaudet/code-impots-annexe-iii/data/train-00000-of-00001.parquet\",\n", + " \"hf://datasets/louisbrulenaudet/code-impots-annexe-iv/data/train-00000-of-00001.parquet\",\n", + " \"hf://datasets/louisbrulenaudet/code-impositions-biens-services/data/train-00000-of-00001.parquet\",\n", + " \"hf://datasets/louisbrulenaudet/livre-procedures-fiscales/data/train-00000-of-00001.parquet\"\n", + "]\n", + "\n", + "legi_dataframe = pl.concat(\n", + " [\n", + " pl.scan_parquet(\n", + " book\n", + " ) for book in books\n", + " ]\n", + ").with_columns(\n", + " [\n", + " (\n", + " pl.lit(\"https://www.legifrance.gouv.fr/codes/article_lc/\")\n", + " .add(pl.col(\"id\"))\n", + " .alias(\"url_sourcepage\")\n", + " ),\n", + " (\n", + " pl.col(\"dateDebut\")\n", + " .cast(pl.Int64)\n", + " .map_elements(\n", + " lambda x: datetime.fromtimestamp(x / 1000).strftime(\"%Y-%m-%d %H:%M:%S\"),\n", + " return_dtype=pl.Utf8\n", + " )\n", + " .alias(\"date_publication\")\n", + " ),\n", + " (\n", + " pl.col(\"texte\")\n", + " .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n", + " .alias(\"hash\")\n", + " )\n", + " ]\n", + ").rename(\n", + " {\n", + " \"texte\": \"text\",\n", + " \"num\": \"id_sub\",\n", + " }\n", + ").select(\n", + " [\n", + " \"text\",\n", + " \"title_main\",\n", + " \"id_sub\",\n", + " \"url_sourcepage\",\n", + " \"date_publication\",\n", + " \"hash\"\n", + " ]\n", + ")\n", + "\n", + "print(\"Starting embeddings production...\")\n", + "\n", + "dataframe = pl.concat(\n", + " [\n", + " bofip_dataframe,\n", + " legi_dataframe\n", + " ]\n", + ").filter(\n", + " pl.col(\n", + " \"text\"\n", + " ).is_not_null()\n", + ").with_columns(\n", + " pl.col(\"text\").map_elements(\n", + " lambda x: sentence_transformer_ef(\n", + " [x]\n", + " )[0].tolist(),\n", + " return_dtype=pl.List(pl.Float64)\n", + " ).alias(\"lemone_pro_embeddings\")\n", + ").collect()" + ], + "metadata": { + "id": "KkOYEOeQ1Kcn" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Index creation\n", + "\n", + "This cell initializes a ChromaDB client with telemetry disabled, sets up a SentenceTransformer embedding model (using \"lemone-embed-pro\" with GPU acceleration if available), and creates or retrieves a collection named \"tax\" that will store the document embeddings using this model configuration." + ], + "metadata": { + "id": "PX2NybWKthV7" + } + }, + { + "cell_type": "code", + "source": [ + "client = chromadb.Client(\n", + " settings=Settings(anonymized_telemetry=False)\n", + ")\n", + "\n", + "sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(\n", + " model_name=\"louisbrulenaudet/lemone-embed-pro\",\n", + " device=\"cuda\" if is_available() else \"cpu\",\n", + " trust_remote_code=True\n", + ")\n", + "\n", + "collection = client.get_or_create_collection(\n", + " name=\"tax\",\n", + " embedding_function=sentence_transformer_ef\n", + ")" + ], + "metadata": { + "id": "T9OHkgaIt9Ki" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Populates the ChromaDB collection by adding document embeddings from the \"lemone_pro_embeddings\" column, their corresponding text content, all remaining columns as metadata, and automatically generated sequential IDs for each document.\n" + ], + "metadata": { + "id": "fGQHsmjCvuZW" + } + }, + { + "cell_type": "code", + "source": [ + "collection.add(\n", + " embeddings=dataframe[\"lemone_pro_embeddings\"].to_list(),\n", + " documents=dataframe[\"text\"].to_list(),\n", + " metadatas=dataframe.remove_columns(\n", + " [\n", + " \"lemone_pro_embeddings\",\n", + " \"text\"\n", + " ]\n", + " ).to_list(),\n", + " ids=[\n", + " str(i) for i in range(0, dataframe.shape[0])\n", + " ]\n", + ")" + ], + "metadata": { + "id": "VjC22bRauAk-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Collection querying" + ], + "metadata": { + "id": "BVJWOhhW3vjW" + } + }, + { + "cell_type": "code", + "source": [ + "collection.query(\n", + " query_texts=[\"Les personnes morales de droit public ne sont pas assujetties à la taxe sur la valeur ajoutée pour l'activité de leurs services administratifs, sociaux, éducatifs, culturels et sportifs lorsque leur non-assujettissement n'entraîne pas de distorsions dans les conditions de la concurrence.\"],\n", + " n_results=10,\n", + ")" + ], + "metadata": { + "id": "-xdrJPCRuBQ4" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file