Part of GPT-RAG
The diagram below provides an overview of the document ingestion pipeline, which handles various document types, preparing them for indexing and retrieval.
Workflow
-
The
ragindex-indexer-chunk-documents
indexer reads new documents from thedocuments
blob container. -
For each document, it calls the
document-chunking
function app to segment the content into chunks and generate embeddings using the ADA model. -
Finally, each chunk is indexed in the AI Search Index.
The document_chunking
function breaks documents into smaller segments called chunks.
When a document is submitted, the system identifies its file type and selects the appropriate chunker to divide it into chunks suitable for that specific type.
-
For
.pdf
files, the system uses the DocAnalysisChunker with the Document Intelligence API, which extracts structured elements, like tables and sections, converting them into Markdown. LangChain splitters then segment the content based on sections. When Document Intelligence API 4.0 is enabled,.docx
and.pptx
files are processed with this chunker as well. -
For image files such as
.bmp
,.png
,.jpeg
, and.tiff
, the DocAnalysisChunker performs Optical Character Recognition (OCR) to extract text before chunking. -
For specialized formats, specific chunkers are applied:
.vtt
files (video transcriptions) are handled by the TranscriptionChunker, chunking content by time codes..xlsx
files (spreadsheets) are processed by the SpreadsheetChunker, chunking by rows or sheets.
-
For text-based files like
.txt
,.md
,.json
, and.csv
, the LangChainChunker uses LangChain splitters to divide the content by paragraphs or sections.
This setup ensures each document is processed by the most suitable chunker, leading to efficient and accurate chunking.
Important: The file extension determines the choice of chunker as outlined above.
Customization
The chunking process is customizable. You can modify existing chunkers or create new ones to meet specific data processing needs, optimizing the pipeline.
This repository supports image ingestion for a multimodal RAG scenario. For an overview of how multimodality is implemented in GPT-RAG, see Multimodal RAG Overview.
To enable multimodal ingestion, set the MULTIMODAL
environment variable to true
before starting to index your data.
When MULTIMODAL
is set to true
, the data ingestion pipeline extends its capabilities to handle both text and images within your source documents, using the MultimodalChunker
. Below is an overview of how this multimodal ingestion process works, including image extraction, captioning, and cleanup.
-
Thresholded Image Extraction
- The system uses Document Intelligence to parse each document, detecting text elements as well as embedded images. This approach extends the standard
DocAnalysisChunker
by adding image extraction steps on top of the usual text-based process. - To avoid clutter and maintain relevance, an area threshold is applied so that only images exceeding a certain percentage of the page size are ingested. This ensures very small or irrelevant images are skipped.
- Any images meeting or exceeding this threshold are then extracted for further processing.
- The system uses Document Intelligence to parse each document, detecting text elements as well as embedded images. This approach extends the standard
-
Image Storage in Blob Container
- Detected images are downloaded and placed in a dedicated Blob Storage container (by default
documents-images
). - Each image is assigned a blob name and a URL, enabling the ingestion pipeline (and later queries) to reference where the image is stored.
- Detected images are downloaded and placed in a dedicated Blob Storage container (by default
-
Textual Content and Captions
- Alongside normal text chunking (paragraphs, sections, etc.), each extracted image is captioned to generate a concise textual description of its contents.
- These captions are combined with the surrounding text, allowing chunks to contain both plain text and image references (with descriptive captions).
-
Unified Embeddings and Indexing
- The ingestion pipeline produces embeddings for both text chunks and the generated image captions, storing them in the AI Search Index.
- The index is adapted to include fields for
contentVector
(text embeddings) andcaptionVector
(image caption embeddings), as well as references to any related images in thedocuments-images
container. - This architecture allows multimodal retrieval, where queries can match either the main text or the descriptive captions.
-
Image Cleanup Routine
- A dedicated purging process periodically checks the
documents-images
container and removes any images no longer referenced in the AI Search Index. - This ensures storage is kept in sync with ingested content, avoiding orphaned or stale images that are no longer needed.
- A dedicated purging process periodically checks the
By activating MULTIMODAL
, your ingestion process captures both text and visuals in a single workflow, providing a richer knowledge base for Retrieval Augmented Generation scenarios. Queries can match not just textual content but also relevant image captions, retrieving valuable visual context stored in documents-images
.
If you are using NL2SQL or Chat with Fabric Data strategies in your orchestration component, you need to index some metadata. Additionally, you can index sample query content to assist with retrieval during query generation. This indexed content helps generate SQL and DAX queries more effectively using these strategies. More details about agentic strategies can be found in the orchestrator repository.
The ingestion process indexes two types of content:
- query: Sample queries used for few-shot learning by the orchestrator (optional).
- table: Descriptions of tables and their columns, serving as a data dictionary to help the orchestrator identify relevant tables for user queries.
Each item—whether a query or a table—is represented as a JSON file containing specific information. JSON files should be stored in the queries
and tables
folders inside the nl2sql
container in the solution's storage account.
The diagram below illustrates the NL2SQL data ingestion pipeline:
Here’s an example of a table metadata file:
{
"table": "dimension_city",
"description": "City dimension table containing details of locations associated with sales and customers.",
"datasource": "wwi-sales-star-schema",
"columns": [
{
"name": "CityKey",
"description": "Primary key for city records."
},
{
"name": "WWICityID",
"description": "Identifier for the city in the worldwide database."
},
{
"name": "City",
"description": "Name of the city."
},
{
"name": "StateProvince",
"description": "State or province where the city is located."
},
{
"name": "Country",
"description": "Country where the city is located."
}
]
}
Here’s an example of an SQL query file:
{
"datasource": "adventureworks",
"question": "What are the top 5 most expensive products currently available for sale?",
"query": "SELECT TOP 5 ProductID, Name, ListPrice FROM SalesLT.Product WHERE SellEndDate IS NULL ORDER BY ListPrice DESC",
"selected_tables": [
"SalesLT.Product"
],
"selected_columns": [
"SalesLT.Product-ProductID",
"SalesLT.Product-Name",
"SalesLT.Product-ListPrice",
"SalesLT.Product-SellEndDate"
],
"reasoning": "This query retrieves the top 5 products with the highest selling prices that are currently available for sale. It uses the SalesLT.Product table, selects relevant columns, and filters out products that are no longer available by checking that SellEndDate is NULL."
}
Here’s an example of a DAX query:
{
"datasource": "wwi-sales-aggregated-data",
"question": "Who are the top 5 employees with the highest total sales including tax?",
"query": "EVALUATE TOPN(5, SUMMARIZE(aggregate_sale_by_date_employee, aggregate_sale_by_date_employee[Employee], aggregate_sale_by_date_employee[SumOfTotalIncludingTax]), aggregate_sale_by_date_employee[SumOfTotalIncludingTax], DESC)",
"selected_tables": [
"aggregate_sale_by_date_employee"
],
"selected_columns": [
"aggregate_sale_by_date_employee[Employee]",
"aggregate_sale_by_date_employee[SumOfTotalIncludingTax]"
],
"reasoning": "This DAX query identifies the top 5 employees based on the total sales amount including tax. It leverages the aggregate_sale_by_date_employee table, aggregates the sales data by employee, and orders the results to display the highest earners first."
}
Additional examples of queries and tables can be found in the samples directory of this repository.
SQL Database examples are based on the Adventure Works sample SQL Database, which you can install in an Azure SQL Database.
Sample Adventure Works Database Tables
Fabric-based examples use the fictional Wide World Importers company Lakehouse and a semantic model generated using this tutorial.
Every JSON file, whether describing a query or a table, contains a datasource field. This field represents the datasource ID, which is an internal identifier used by GPT-RAG to manage multiple data sources.
The datasource information is stored as a JSON document in the datasources
container within CosmosDB, used by GPT-RAG. This document contains relevant details about the specific datasource, including its type and connection details.
Example of Datasources in CosmosDB
Currently, there are three types of datasources:
- Semantic Model
- SQL Endpoint
- SQL Database
The first two are designed for Fabric, where the orchestrator connects to the datasource using a Service Principal/App Registration. For SQL Database connections, Managed Identity is used. Instructions on configuring connections for Fabric and SQL Database can be found in the administration guide in the main GPT-RAG repository.
Below are examples of different types of datasource configurations:
{
"id": "wwi-sales-aggregated-data",
"description": "This data source is a semantic model containing aggregated sales data. It is ideal for insights such as sales by employee or city.",
"type": "semantic_model",
"organization": "myorg",
"dataset": "your_dataset_or_semantic_model_name",
"tenant_id": "your_sp_tenant_id",
"client_id": "your_sp_client_id"
}
{
"id": "wwi-sales-star-schema",
"description": "This data source is a star schema that organizes sales data. It includes a fact table for sales and dimension tables such as city, customer, and inventory items (products).",
"type": "sql_endpoint",
"organization": "myorg",
"server": "your_sql_endpoint. Ex: xpto.datawarehouse.fabric.microsoft.com",
"database": "your_lakehouse_name",
"tenant_id": "your_sp_tenant_id",
"client_id": "your_sp_client_id"
}
{
"id": "adventureworks",
"description": "AdventureWorksLT is a database featuring a schema with tables for customers, orders, products, and sales.",
"type": "sql_database",
"database": "adventureworkslt",
"server": "sqlservername.database.windows.net"
}
Workflow
This outlines the ingestion workflow for query elements.
Note:
The workflow for tables and columns is similar; just replace queries with tables or columns in the steps below.
-
The AI Search
queries-indexer
scans for new query files (each containing a single query) within thequeries
folder in thenl2sql
storage container.Note:
Files are stored in thequeries
folder, not in the root of thenl2sql
container. This setup also applies totables
andcolumns
. -
The
queries-indexer
then uses the#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill
to create a vectorized representation of the question text using the Azure OpenAI Embeddings model.Note:
For query items, the question itself is vectorized. For tables and columns, their descriptions are vectorized. -
Finally, the indexed content is added to the
nl2sql-queries
index.
The SharePoint connector operates through two primary processes, each running in a separate function within the Data Ingestion Function App:
- Indexing SharePoint Files:
sharepoint_index_files
function retrieves files from SharePoint, processes them, and indexes their content into the Azure AI Search Index (ragindex
). - Purging Deleted Files:
sharepoint_purge_deleted_files
identifies and removes files that have been deleted from SharePoint to keep the search index up-to-date.
Both processes are managed by scheduled Azure Functions that run at regular intervals, leveraging configuration settings to determine their behavior. The diagram below illustrates the Sharepoint indexing.
Workflow
1.1. List files from a specific SharePoint site, directory, and file types configured in the settings.
1.2. Check if the document exists in the AI Search Index. If it exists, compare the metadata_storage_last_modified
field to determine if the file has been updated.
1.3. Use the Microsoft Graph API to download the file if it is new or has been updated.
1.4. Process the file content using the regular document chunking process. For specific formats, like PDFs, use Document Intelligence.
1.5. Use Azure OpenAI to generate embeddings for the document chunks.
1.6. Upload the processed document chunks, metadata, and embeddings into the Azure AI Search Index.
2.1. Connect to the Azure AI Search Index to identify indexed documents.
2.2. Query the Microsoft Graph API to verify the existence of corresponding files in SharePoint.
2.3. Remove entries in the Azure AI Search Index for files that no longer exist.
Azure Function triggers automate the indexing and purging processes. Indexing runs at regular intervals to ingest updated SharePoint files, while purging removes deleted files to maintain an accurate search index. By default, both processes run every 10 minutes when enabled.
If you'd like to learn how to set up the SharePoint connector, check out SharePoint Connector Setup.
-
Provision the infrastructure and deploy the solution using the GPT-RAG template.
-
Redeployment Steps:
- Prerequisites:
- Azure Developer CLI
- PowerShell (Windows only)
- Git
- Python 3.11
- Redeployment commands:
azd auth login azd env refresh azd deploy
Note: Use the same environment name, subscription, and region as the initial deployment when running
azd env refresh
.
- Prerequisites:
- Instructions for testing the data ingestion component locally using in VS Code. See Local Deployment Guide.
Follow the instructions to configure the SharePoint Connector in the Configuration Guide: SharePoint Connector.
- Refer to the GPT-RAG Admin & User Guide for instructions.
- See GPT-RAG Admin & User Guide for reindexing instructions.
Here are the formats supported by each chunker. The file extension determines which chunker is used.
Extension | Doc Int API Version |
---|---|
3.1, 4.0 | |
bmp | 3.1, 4.0 |
jpeg | 3.1, 4.0 |
png | 3.1, 4.0 |
tiff | 3.1, 4.0 |
xlsx | 4.0 |
docx | 4.0 |
pptx | 4.0 |
Extension | Format |
---|---|
md | Markdown document |
txt | Plain text file |
html | HTML document |
shtml | Server-side HTML document |
htm | HTML document |
py | Python script |
json | JSON data file |
csv | Comma-separated values file |
xml | XML data file |
Extension | Format |
---|---|
vtt | Video transcription |
Extension | Format |
---|---|
xlsx | Spreadsheet |