This repository demonstrates how to extract text, images, and structured content from PDF documents using pymupdf4llm
in Google Colab. It also includes data preparation for LlamaIndex
for further document analysis and information extraction.
The project involves:
- Converting PDFs to Markdown format.
- Saving extracted content to files.
- Extracting specific pages.
- Preparing data for LlamaIndex.
- Extracting images with specified DPI.
- Chunking content for metadata-rich extraction.
- Detailed word-by-word extraction for comprehensive analysis.
- Markdown Conversion: Convert PDF files to markdown format.
- Save Extracted Content: Save extracted text to a file.
- Page-Specific Extraction: Extract content from specific pages.
- LlamaIndex Compatibility: Prepare extracted data for LlamaIndex processing.
- Image Extraction: Extract images with options for resolution and format.
- Chunked Data Extraction: Extract data in chunks with metadata for context.
- Word-by-Word Extraction: Extract detailed word-level content from PDFs.
Install the required packages:
!pip install pymupdf4llm llama_index
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("test.pdf")
print(md_text)
# Save the extracted content to a Markdown file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())
md_text_pages = pymupdf4llm.to_markdown("test.pdf", pages=[1, 2])
print(md_text_pages)
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("test.pdf")
print(f"LlamaIndex Compatible Data: {len(llama_docs)}")
print(llama_docs[0].text[:500])
md_text_images = pymupdf4llm.to_markdown(
doc="test.pdf",
pages=[0, 2],
page_chunks=True,
write_images=True,
image_path="images",
image_format="png",
dpi=300
)
md_text_chunks = pymupdf4llm.to_markdown(
doc="test.pdf",
pages=[0, 1, 2],
page_chunks=True
)
print(md_text_chunks[0])
md_text_words = pymupdf4llm.to_markdown(
doc="test.pdf",
pages=[0, 1, 2],
page_chunks=True,
write_images=True,
image_path="images",
image_format="png",
dpi=300,
extract_words=True
)
print(md_text_words[0]['words'][:5])
The extracted content can be used for further data analysis, NLP applications, or preparing training data for machine learning models.
This project is licensed under the MIT License.