paperless-gpt OCR results into selectable PDF text layer? #135

victorhooi · 2025-01-19T07:07:53Z

victorhooi
Jan 19, 2025

I looked at the video linked from the paperless-gpt README, and it seems like the OCR results are simply shown onscreen, right?

However, they aren't actually merged in as a selectable text layer on the PDF, right?

Or has it changed since the video? =)

gardar · 2025-02-04T16:14:16Z

gardar
Feb 4, 2025

It does not currently get embedded in the pdf, it simply replaces the "Content" section for the document.
The output that comes from the default ocr prompt is just text and contains no information about where to position.

I have been trying to customize the prompt to get it to output valid hOCR, which is a format that includes the OCR text as well as positioning, but I'm getting mixed results. If someone figures out a reliable way to get valid hocr output then merging it with the pdf to create a selectable text layer would be trivial.
Instead of using openai I'm currently leaning towards using something like google document ai which has official support for hOCR https://cloud.google.com/document-ai/docs/samples/documentai-toolbox-document-to-hocr

4 replies

icereed Feb 7, 2025
Maintainer

What if we add a second cycle where we ask the LLM to compare the old hOCR and add in the newly scanned OCR? It could take all the positioning arguments of the olf hOCR and merge it somehow with the new content. Caveat: I'm completely new to hOCR and might underestimate the complexity.

gardar Feb 7, 2025

Yeah, that could work! hOCR isn't that difficult of a format, it's just xml containing positional data and the ocr text. I suspect the ai datasets just have limited samples of it and that's why I get mixed results, so providing the hocr would probably work.

It could either be a second cycle, or perhaps uploading the document plus the hocr in a single cycle could work.

If you want to give it a quick spin then you can extract / merge the hocr using the python ocrmypdf library.

import ocrmypdf
from pathlib import Path
ocrmypdf.api._pdf_to_hocr(input_pdf=Path("in.pdf"), output_folder=Path("./output"))

import ocrmypdf
from pathlib import Path
ocrmypdf.api._hocr_to_ocr_pdf(work_folder=Path("./output/"), output_file=Path("out.pdf"))

Btw, multiple cycles is an interesting concept for other things as well, such as first identifying the type/format of a document and then running the actual ocr, that way you could have different prompt templates for different docs, etc.

icereed Feb 7, 2025
Maintainer

This would eventually lead to a more agentic approach to document analysis and OCR 🤔 it would be interesting to run a PoC to verify it’s worth the effort

gardar Feb 8, 2025

I threw together a quick and dirty python script to test both single and multi stage ocr/hocr with openai.

It seems to work pretty good, the bad tesseract ocr is all replaced correctly with the openai one.
Some text positions are a little short due to the new ocr containing longer text than the old one, but in my prompt I specifically asked for the positions to be left untouched so that issue might be fixable.

import os
import openai
import ocrmypdf
from pathlib import Path
import base64
from PIL import Image
import re
import shutil

def extract_pdf(input_pdf: str, output_folder: str):
    """Extract text and images from PDF"""
    ocrmypdf.api._pdf_to_hocr(input_pdf=Path(input_pdf), output_folder=Path(output_folder))

def read_prompt(file_path: str) -> str:
    """Read a prompt from a file."""
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def clean_hocr_content(hocr_content: str) -> str:
    """Remove markdown formatting and extract pure hOCR"""
    # Remove markdown code blocks (```xml ... ```)
    hocr_content = re.sub(r'```xml\n?|```', '', hocr_content, flags=re.DOTALL).strip()
    
    # Extract only the valid XML content
    match = re.search(r'(<!DOCTYPE html.*?</html>)', hocr_content, re.DOTALL)
    return match.group(1) if match else hocr_content

def read_hocr_file(hocr_path: str) -> str:
    if os.path.exists(hocr_path):
        with open(hocr_path, 'r', encoding='utf-8') as file:
            content = file.read()
        return clean_hocr_content(content)
    return ""

def is_valid_image(image_path: str) -> bool:
    try:
        with Image.open(image_path) as img:
            img.verify()  # Verify that it's an image
        return True
    except Exception:
        return False

def encode_image_to_base64(image_path: str) -> str:
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def send_to_openai(image_path: str, prompt: str, hocr_text: str = None) -> str:
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("OpenAI API key not found in environment variables.")
    
    client = openai.OpenAI(api_key=api_key)
    
    messages = [{"role": "system", "content": prompt}]
    
    if hocr_text:
        messages.append({"role": "user", "content": [{"type": "text", "text": "Here is the extracted hOCR XML data:"},
                                                         {"type": "text", "text": hocr_text}]})
    
    if is_valid_image(image_path):
        image_base64 = encode_image_to_base64(image_path)
        image_format = "jpeg" if image_path.lower().endswith(".jpg") or image_path.lower().endswith(".jpeg") else "png"
        
        messages.append({
            "role": "user",
            "content": [
                {"type": "text", "text": "Please refine the OCR text based on this image and hOCR data."},
                {"type": "image_url", "image_url": {"url": f"data:image/{image_format};base64,{image_base64}"}}
            ]
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=8192  # Increase max token limit to prevent truncation
    )
    
    return clean_hocr_content(response.choices[0].message.content)

def process_ocr_cycle(input_pdf: str, output_folder: str, cycle_type: str):
    work_dir = f"{output_folder}/{cycle_type}"
    os.makedirs(work_dir, exist_ok=True)
    extract_pdf(input_pdf, work_dir)
    
    ocr_image = f"{work_dir}/000001_ocr.png"
    ocr_hocr = f"{work_dir}/000001_ocr_hocr.hocr"
    backup_hocr = f"{work_dir}/000001_ocr_hocr_original.hocr"
    
    if os.path.exists(ocr_hocr):
        shutil.copy(ocr_hocr, backup_hocr)  # Backup original hOCR file
    
    if cycle_type == "two_cycle":
        first_prompt = read_prompt("ocr_prompt.txt")
        first_response = send_to_openai(ocr_image, first_prompt)
        
        # Save first response for debugging
        with open(f"{work_dir}/000001_ocr_first_reply.txt", "w", encoding="utf-8") as file:
            file.write(first_response)
        
        second_prompt = read_prompt("hocr_prompt.txt")
        hocr_text = read_hocr_file(ocr_hocr)
        second_response = send_to_openai(ocr_image, second_prompt + "\n" + first_response, hocr_text)
        
        with open(ocr_hocr, "w", encoding="utf-8") as file:
            file.write(second_response)
    else:
        prompt = read_prompt("single_cycle_prompt.txt")
        hocr_text = read_hocr_file(ocr_hocr)
        response = send_to_openai(ocr_image, prompt, hocr_text)
        
        with open(ocr_hocr, "w", encoding="utf-8") as file:
            file.write(response)

def assemble_pdf(output_folder: str, cycle_type: str):
    """Assemble a PDF using the updated hOCR data."""
    work_dir = f"{output_folder}/{cycle_type}"
    output_pdf = f"{output_folder}/out_{cycle_type}.pdf"
    ocrmypdf.api._hocr_to_ocr_pdf(work_folder=Path(work_dir), output_file=Path(output_pdf))
    print(f"Assembled PDF saved to {output_pdf}")

if __name__ == "__main__":
    input_pdf = "in.pdf"
    output_folder = "./output"
    
    process_ocr_cycle(input_pdf, output_folder, "two_cycle")
    process_ocr_cycle(input_pdf, output_folder, "single_cycle")
    
    assemble_pdf(output_folder, "two_cycle")
    assemble_pdf(output_folder, "single_cycle")

You need to provide a

openai api key with the OPENAI_API_KEY env var
a single page pdf with the name in.pdf
a file named single_cycle_prompt.txt which is a prompt for the single cycle process which sends the image to ocr along with the hocr.
a file named ocr_prompt.txt which is the ocr prompt for the dual cycle process.
a file named hocr_prompt.txt which is the hocr prompt for the dual cycle process.

For example:

single_cycle_prompt.txt

Please read my instructions carefully and be sure to follow them through the
end. Your I am sending you an image which was processed with ocr software to
create a hocr, the ocr is not great and I need you to fix the hocr file, your
job is to transcribe the text in this image and preserve the formatting and layout (high quality OCR). Do that for ALL the text in the image. Be thorough and pay attention. This is very important. The image is from a text document so be sure to continue until the bottom of the page. Thanks a lot! You tend to forget about some text in the image so please focus! Please output the
fixed hocr, without a markdown code block.

ocr_prompt.txt

Just transcribe the text in this image and preserve the formatting and layout (high quality OCR). Do that for ALL the text in the image. Be thorough and pay attention. This is very important. The image is from a text document so be sure to continue until the bottom of the page. Thanks a lot! You tend to forget about some text in the image so please focus! Use markdown format but without a code block.

hocr_prompt.txt

Please read my instructions carefully and be sure to follow them through the
end. I have a hocr file which contains low quality ocr text but accurate
ocr positions. Your job is to just process the improved ocr I'm sending you and use it
to update the ocr text in the HOCR file, please make sure to retain the
positions and only update the ocr text. Be thorough and pay attention. This is
very important. Please output the fixed hocr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paperless-gpt OCR results into selectable PDF text layer? #135

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

paperless-gpt OCR results into selectable PDF text layer? #135

victorhooi Jan 19, 2025

Replies: 1 comment · 4 replies

gardar Feb 4, 2025

icereed Feb 7, 2025 Maintainer

gardar Feb 7, 2025

icereed Feb 7, 2025 Maintainer

gardar Feb 8, 2025

victorhooi
Jan 19, 2025

Replies: 1 comment 4 replies

gardar
Feb 4, 2025

icereed Feb 7, 2025
Maintainer

icereed Feb 7, 2025
Maintainer