Check out the workshop based on this repo and solution, offering deep learning with self-paced activities.
In today's era of Generative AI, customers can unlock valuable insights from their unstructured or structured data to drive business value. By infusing AI into their existing or new products, customers can create powerful applications, which puts the power of AI into the hands of their users. For these Generative AI applications to work on customers data, implementing efficient RAG (Retrieval augment generation) solution is key to make sure the right context of the data is provided to the LLM based on the user query.
Customers have PDF documents with text and embedded figures which could be images or diagrams holding valuable information that they would like to use as a context to the LLM to answer a given user query. Parsing those PDFs to implement an efficient RAG solution is challenging, especially when the customer wants to maintain the relationship between the text and extracted image context used to answer the user query. Also, referencing the image as part of the citation which answers the user query is also challenging if the images are not extracted and are retrievable. This blog post addressing the challenge of extracting PDF content with text and images as part of the RAG solution, where the relationship between the searchable text context with any of its extracted images is maintained so that the images can be retrieved as references within the citations.
Below we outline a simple architecture to build a RAG application on PDF data, where the extracted image content within the PDF is also retrievable as part of the LLM output as part of citation references.
This repository is broken down in following sections (click at the hyperlink of each section for further details):
Section | Description |
---|---|
Azure Function | Focuses on processing the raw PDF files by chunking the text and extracting the images as illustrated in the reference architecture below. |
Azure AI Search | Focuses on configuring the index , indexer , and skillset . AI Search is using the processed data by Azure Function and populate the index. |
Demo application | Showcases an end-to-end demo as illustrated in the reference architecture below. |
Demo with notebook code | Illustrates a simplified version of end-to-end representing the client/server components. |
Azure Bicep | Infrastructure code to deploy the solution |
This section includes three diagrams:
- Document Data Management: This diagram illustrates the process from PDF upload and vectorization to data being indexed in Azure AI Search, making it ready to handle query requests.
- Application Runtime: This diagram outlines the complete flow of user requests and responses.
- Azure Blob Directory and File Structure: This diagram shows how data is organized in Azure Blob.
The document data management flow operates as follows:
- A raw PDF document file is uploaded to Azure Blob storage.
- An event trigger in Azure Blob invokes an Azure Function, which then splits large PDFs, extracts text chunks, and maps images to the corresponding text chunks.
- Once the Azure Function prepares the data, it uploads the prepaired data back to Azure Blob storage.
- An index scheduler is then invoked to initiate the indexing process for the prepared data.
- The prepared data is retrieved from Azure Blob by Azure AI Search.
- Azure AI Search processes the text chunks in parallel, using the Azure OAI embedding model to vectorize the text.
- The Azure AI Search index is populated with the prepared data and vectorized chunks. Additionally, it maps the relevant images to their corresponding text chunks using a custom index field.
The application runtime flow operates as follows:
- User makes a query request through the client-side application.
- The server-side AI chatbot application forwards the user's query to Azure OAI. Note: This step is an ideal point to implement controls such as safety measures using the Azure AI Content Safety service.
- Azure OAI, given the user's query, makes a request to Azure AI Search to retrieve relevant text and images. Notably, the responsibility for making the request to Azure AI Search shifts from the application code to the AIAO service itself.
- With the user's query and the relevant text retrieved from Azure AI Search, AIAO generates the response.
- AIAO returns the generated response and associated metadata (e.g., citation data) to the server-side AI chatbot application.
- The server-side AI chatbot application remaps the response data, creating a payload that includes text and image URLs. This step is another excellent point to implement additional controls before sending the payload back to the client-side application.
- The server-side AI chatbot application sends the response to the user's query back to the client-side application.
- The client-side application displays the generated response text and downloads any images from Azure Blob, rendering them in the user interface.
Note
: Steps 9a and 9b are conceptual components of the reference architecture but are not currently part of the deployable artifact. We welcome your feedback and may potentially extend the implementation to include these steps.
The directory and file structure serve the following primary purposes:
- Azure Function: To retrieve raw PDF files and upload the prepared data back. The event trigger is configured to receive events under the
raw_data
directory. - Azure AI Search: To download the prepared data for populating the index. The Azure AI Search data source is configured to retrieve data from the
prepared_data
directory.
- Azure subscription
- Azure CLI
Note
: Ensure the az bicep extension is installed. You can install it by runningaz bicep install
- Permission to create and access resources in Azure
- Docker
- If you're on Windows, WSL with Ubuntu distro, Azure CLI and Docker to also be installed inside Ubuntu
- Azure Open AI chat and embedding models deployed
Note
: If you don't have the models deployed, you can follow the create and deploy an Azure OpenAI Service resource guide to do so.
Note
: This solution was developed and tested usinggpt-4o
as the chat model, andtext-embedding-ada-002
as the embedding model. Alternative models are likely to work too, but for the best experience, we recommend using the same models whenever possible.
az login --tenant "your-tenant-id-here"
Clone or download this repo and cd into project's root directory.
For Azure AI Search to be configured connrectly, and demo app to work, we first need to create a configuration file that will have required information about your deployed Azure Open AI chat and embedding models.
Create .env_aoai
file in root directory of this repository. Following are the variables that need to be set, with example values.
You can referr to demo application section for guidance on where to obtain each of the values.
AZURE_OPENAI_ENDPOINT=https://my-domain-name.openai.azure.com/
AZURE_OPENAI_KEY=my-azure-open-ai-key
AZURE_OPENAI_CHATGPT_DEPLOYMENT=my-gpt-deployment-name
AZURE_OPENAI_API_VERSION=2024-04-01-preview
AZURE_OPENAI_CHATGPT_EMBEDDING_DEPLOYMENT=my-gpt-deployment-embedding-model-name
AZURE_OPENAI_CHATGPT_EMBEDDING_MODEL_NAME=text-embedding-ada-002
There are two option running through the deployment:
- Option 1: You want all the deployment to be seemless in background and go straight to testing the solution using the demo app.
- Option 2 (Recommanded): You want to go step-by-setp to gain better understanding what's invloved in setting up the solution, and only after running the demo app.
Note
: A helper bash script will be used to deploy all parts. You can use the -x
bash option if you'd like to see more details of what's being executed. Example: bash -x ./helper.sh test
-
Build docker image, run the container and exec
bash ./helper.sh docker-up
-
Deploy the solution (might take ~10min to complete)
bash ./helper.sh deploy
-
Try the demo app
Open the demo app in your browserhttp://localhost:8501
. In chat window, typeTell me about Kubernetes.
You should see a response and an overall demo app UI view similar to the image below.
-
Build docker image with all required dependencies
bash ./helper.sh docker-build
-
Run docker container
bash ./helper.sh docker-run
The container will:
- Mount a volume with azure cli dir (aka
~/.azure
), to use azure credentials for Azure resource deployment - Mount a volume of this repository
- Bind port 8501 to access the demo app
-
Attach bash to container in interactive mode
bash ./helper.sh docker-exec
-
Create Azure resource group
bash ./helper.sh create-resource-group
-
Deploy infrastructure
bash ./helper.sh deploy-bicep
-
Create .env file using bicep outputs
bash ./helper.sh create-dot-env
-
Configure the deployed Azure AI Search service
Create data source, index, skillset, and indexer.
bash ./helper.sh setup-ai-search
-
Deploy Azure Function code
bash ./helper.sh deploy-function
-
Upload sample PDF document
Sample document is used that's located in./sample-documents
directory. It's a few-page document from Azure ASK document.
bash ./helper.sh upload-pdf
Note
: Before executing the next command, please wait about 60 seconds for Azure Function to prepare the uploaded PDF document, so it's ready to be indexed by Azure AI Search indexer.
-
Run Azure AI Search indexer to populate the index
bash ./helper.sh run-indexer
-
Create .env file for the demo app
bash ./helper.sh create-dot-env-demo-app
-
Install demo app python dependencies
bash ./helper.sh install-demo-app-dependencies
-
Run the demo app
bash ./helper.sh run-demo-app
-
Try the demo app
Open the demo app in your browserhttp://localhost:8501
. In chat window, typeTell me about Kubernetes.
You should see a response and an overall demo app UI view similar to the image below.
-
Delete all deployed Azure resources (
Note
: To be executed from within the container)
bash ./helper.sh cleanup
-
Stop and remove docker container First, exit the docker container by typing
exit
and hit return in the terminial, and after run the command below.
bash ./helper.sh docker-container-stop-remove
-
Delete docker image
bash ./helper.sh docker-remove-image
You can easily expend this solution to test it on your own documents in just few steps.
-
Prepare the document
Copy your document into the./sample-documents
directory. -
Upload the document to your Azure Blob storage that was provisioned as part of the infrastructure deployment
file_name="myfile.pdf" bash ./helper.sh upload-pdf
-
Run Azure AI Search indexer to index your document
bash ./helper.sh run-indexer
Congratulations! You can now use the demo app to ask questions about your own document.