RAG on PDF with text and embedded Images, with citations referencing image answering user query

Workshop

Check out the workshop based on this repo and solution, offering deep learning with self-paced activities.

Content

Overview
Reference architecture
Infrastructure deployment

Overview

In today's era of Generative AI, customers can unlock valuable insights from their unstructured or structured data to drive business value. By infusing AI into their existing or new products, customers can create powerful applications, which puts the power of AI into the hands of their users. For these Generative AI applications to work on customers data, implementing efficient RAG (Retrieval augment generation) solution is key to make sure the right context of the data is provided to the LLM based on the user query.

Customers have PDF documents with text and embedded figures which could be images or diagrams holding valuable information that they would like to use as a context to the LLM to answer a given user query. Parsing those PDFs to implement an efficient RAG solution is challenging, especially when the customer wants to maintain the relationship between the text and extracted image context used to answer the user query. Also, referencing the image as part of the citation which answers the user query is also challenging if the images are not extracted and are retrievable. This blog post addressing the challenge of extracting PDF content with text and images as part of the RAG solution, where the relationship between the searchable text context with any of its extracted images is maintained so that the images can be retrieved as references within the citations.

Below we outline a simple architecture to build a RAG application on PDF data, where the extracted image content within the PDF is also retrievable as part of the LLM output as part of citation references.

This repository is broken down in following sections (click at the hyperlink of each section for further details):

Section	Description
Azure Function	Focuses on processing the raw PDF files by chunking the text and extracting the images as illustrated in the reference architecture below.
Azure AI Search	Focuses on configuring the `index`, `indexer`, and `skillset`. AI Search is using the processed data by Azure Function and populate the index.
Demo application	Showcases an end-to-end demo as illustrated in the reference architecture below.
Demo with notebook code	Illustrates a simplified version of end-to-end representing the client/server components.
Azure Bicep	Infrastructure code to deploy the solution

Reference architecture

This section includes three diagrams:

Document Data Management: This diagram illustrates the process from PDF upload and vectorization to data being indexed in Azure AI Search, making it ready to handle query requests.
Application Runtime: This diagram outlines the complete flow of user requests and responses.
Azure Blob Directory and File Structure: This diagram shows how data is organized in Azure Blob.

Document data management

The document data management flow operates as follows:

A raw PDF document file is uploaded to Azure Blob storage.
An event trigger in Azure Blob invokes an Azure Function, which then splits large PDFs, extracts text chunks, and maps images to the corresponding text chunks.
Once the Azure Function prepares the data, it uploads the prepaired data back to Azure Blob storage.
An index scheduler is then invoked to initiate the indexing process for the prepared data.
The prepared data is retrieved from Azure Blob by Azure AI Search.
Azure AI Search processes the text chunks in parallel, using the Azure OAI embedding model to vectorize the text.
The Azure AI Search index is populated with the prepared data and vectorized chunks. Additionally, it maps the relevant images to their corresponding text chunks using a custom index field.

Application runtime

The application runtime flow operates as follows:

User makes a query request through the client-side application.
The server-side AI chatbot application forwards the user's query to Azure OAI. Note: This step is an ideal point to implement controls such as safety measures using the Azure AI Content Safety service.
Azure OAI, given the user's query, makes a request to Azure AI Search to retrieve relevant text and images. Notably, the responsibility for making the request to Azure AI Search shifts from the application code to the AIAO service itself.
With the user's query and the relevant text retrieved from Azure AI Search, AIAO generates the response.
AIAO returns the generated response and associated metadata (e.g., citation data) to the server-side AI chatbot application.
The server-side AI chatbot application remaps the response data, creating a payload that includes text and image URLs. This step is another excellent point to implement additional controls before sending the payload back to the client-side application.
The server-side AI chatbot application sends the response to the user's query back to the client-side application.
The client-side application displays the generated response text and downloads any images from Azure Blob, rendering them in the user interface.

Note: Steps 9a and 9b are conceptual components of the reference architecture but are not currently part of the deployable artifact. We welcome your feedback and may potentially extend the implementation to include these steps.

Azure Blob directory and file structure

The directory and file structure serve the following primary purposes:

Azure Function: To retrieve raw PDF files and upload the prepared data back. The event trigger is configured to receive events under the raw_data directory.
Azure AI Search: To download the prepared data for populating the index. The Azure AI Search data source is configured to retrieve data from the prepared_data directory.

Infrastructure deployment

Prerequisites

Azure subscription
Azure CLI
Note: Ensure the az bicep extension is installed. You can install it by running az bicep install
Permission to create and access resources in Azure
Docker
If you're on Windows, WSL with Ubuntu distro, Azure CLI and Docker to also be installed inside Ubuntu
Azure Open AI chat and embedding models deployed
Note: If you don't have the models deployed, you can follow the create and deploy an Azure OpenAI Service resource guide to do so.
Note: This solution was developed and tested using gpt-4o as the chat model, and text-embedding-ada-002 as the embedding model. Alternative models are likely to work too, but for the best experience, we recommend using the same models whenever possible.

Login into your Azure Tenant

az login --tenant "your-tenant-id-here"

Git clone

Clone or download this repo and cd into project's root directory.

Creating config file

For Azure AI Search to be configured connrectly, and demo app to work, we first need to create a configuration file that will have required information about your deployed Azure Open AI chat and embedding models.

Create .env_aoai file in root directory of this repository. Following are the variables that need to be set, with example values.

You can referr to demo application section for guidance on where to obtain each of the values.

AZURE_OPENAI_ENDPOINT=https://my-domain-name.openai.azure.com/
AZURE_OPENAI_KEY=my-azure-open-ai-key
AZURE_OPENAI_CHATGPT_DEPLOYMENT=my-gpt-deployment-name
AZURE_OPENAI_API_VERSION=2024-04-01-preview
AZURE_OPENAI_CHATGPT_EMBEDDING_DEPLOYMENT=my-gpt-deployment-embedding-model-name
AZURE_OPENAI_CHATGPT_EMBEDDING_MODEL_NAME=text-embedding-ada-002

Deployment

There are two option running through the deployment:

Option 1: You want all the deployment to be seemless in background and go straight to testing the solution using the demo app.
Option 2 (Recommanded): You want to go step-by-setp to gain better understanding what's invloved in setting up the solution, and only after running the demo app.

Note: A helper bash script will be used to deploy all parts. You can use the -x bash option if you'd like to see more details of what's being executed. Example: bash -x ./helper.sh test

Option 1

Build docker image, run the container and exec
bash ./helper.sh docker-up
Deploy the solution (might take ~10min to complete)
bash ./helper.sh deploy
Try the demo app
Open the demo app in your browser http://localhost:8501. In chat window, type Tell me about Kubernetes. You should see a response and an overall demo app UI view similar to the image below.

Option 2

Build docker image with all required dependencies
bash ./helper.sh docker-build
Run docker container
bash ./helper.sh docker-run

The container will:

Mount a volume with azure cli dir (aka ~/.azure), to use azure credentials for Azure resource deployment
Mount a volume of this repository
Bind port 8501 to access the demo app

Attach bash to container in interactive mode
bash ./helper.sh docker-exec
Create Azure resource group
bash ./helper.sh create-resource-group
Deploy infrastructure
bash ./helper.sh deploy-bicep
Create .env file using bicep outputs
bash ./helper.sh create-dot-env
Configure the deployed Azure AI Search service
Create data source, index, skillset, and indexer.
bash ./helper.sh setup-ai-search
Deploy Azure Function code
bash ./helper.sh deploy-function
Upload sample PDF document
Sample document is used that's located in ./sample-documents directory. It's a few-page document from Azure ASK document.
bash ./helper.sh upload-pdf

Note: Before executing the next command, please wait about 60 seconds for Azure Function to prepare the uploaded PDF document, so it's ready to be indexed by Azure AI Search indexer.

Run Azure AI Search indexer to populate the index
bash ./helper.sh run-indexer
Create .env file for the demo app
bash ./helper.sh create-dot-env-demo-app
Install demo app python dependencies
bash ./helper.sh install-demo-app-dependencies
Run the demo app
bash ./helper.sh run-demo-app
Try the demo app
Open the demo app in your browser http://localhost:8501. In chat window, type Tell me about Kubernetes. You should see a response and an overall demo app UI view similar to the image below.

Cleanup

Delete all deployed Azure resources (Note: To be executed from within the container)
bash ./helper.sh cleanup
Stop and remove docker container First, exit the docker container by typing exit and hit return in the terminial, and after run the command below.
bash ./helper.sh docker-container-stop-remove
Delete docker image
bash ./helper.sh docker-remove-image

Extending deployment with your own documents

You can easily expend this solution to test it on your own documents in just few steps.

Prepare the document
Copy your document into the ./sample-documents directory.
Upload the document to your Azure Blob storage that was provisioned as part of the infrastructure deployment
file_name="myfile.pdf" bash ./helper.sh upload-pdf
Run Azure AI Search indexer to index your document
bash ./helper.sh run-indexer

Congratulations! You can now use the demo app to ask questions about your own document.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
azure-function		azure-function
bicep		bicep
demo-app		demo-app
demo-notebook		demo-notebook
docs		docs
images		images
sample-documents		sample-documents
workshop		workshop
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
helper.sh		helper.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG on PDF with text and embedded Images, with citations referencing image answering user query

Workshop

Content

Overview