Skip to content

A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.

Notifications You must be signed in to change notification settings

DavidMoserAI/AzureDocumentIntelligenceChunker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview 📚

This library provides a production-grade solution for document chunking in Retrieval Augmented Generation (RAG) workflows. By leveraging the hierarchical structure of documents with Azure AI Document Intelligence, the library enhances the quality of chunking beyond fixed-size methods. The resulting chunks retain essential metadata such as page numbers and bounding boxes, allowing for integration with PDF viewers for accurate text highlighting.

Unlike other solutions, this library relies solely on native Python packages, making it an excellent alternative for developers who, for any reason, cannot or prefer not to use libraries like Langchain. This library is designed to produce output that closely mirrors the markdown text format typically generated by the Azure AI Document Intelligence API.

Additionally, to optimize recall and improve the relevance of retrieved information, the library automatically adds the chunks corresponding headers at the beginning of each chunk. This ensures that important keywords are included in the chunked text, boosting retrieval performance.

This library is especially suited for developers building RAG pipelines who need lightweight, traceable, and metadata-rich document processing. Future plans for integration with Azure AI Search will make this tool an even more powerful addition to your RAG stack.

Purpose 🎯

In production settings, Retrieval Augmented Generation (RAG) often requires a sophisticated document chunking strategy to ensure optimal retrieval performance. For unstructured data, fixed-size chunking is typically insufficient. Fortunately, most documents naturally follow a hierarchical structure, which can be leveraged using tools like Azure AI Document Intelligence.

While libraries like Langchain provide a simple interface to interact with the Azure AI Document Intelligence API, they rely on the markdown text output from Document Intelligence and use a markdown header text splitter for chunking. Although effective, this approach excludes crucial metadata, such as page numbers and bounding boxes.

In RAG applications, verifying the correctness of the output is essential. This can be achieved by referencing the information source directly—either by opening the document to a specific page or highlighting the bounding boxes of the text. The bounding boxes generated by this library can be seamlessly integrated with your preferred React PDF viewer for accurate text highlighting.

Local Testing 🧪

Use the test script to test the functionality of the Azure AI Document Intelligence Chunker. This library contains an example JSON file that can be used to test the chunking functionality. The test script will output the chunks in the console.

Integration ⚙️

To integrate the library into your project, copy the code from common into your codebase and use the code provided in test.py to generate the chunks. The test script uses a sample JSON from the library, therefore you need to use the Azure AI Document Intelligence API to generate the JSON for your documents.

Roadmap 🚀

An integration with Azure AI Search is planned for the future. This will allow for a direct integration into the RAG pipeline.

Support 🙋

For issues, questions, or contributions, please open an issue in this repository.

License 📄

MIT License

Copyright (c)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages