How to extract text content from a PDF and write it to a webpage with a unified font format? #116

zhuzhangwei · 2023-09-27T11:46:35Z

zhuzhangwei
Sep 27, 2023

Hello Professor, I would like to extract both text and images from a PDF and output them to a webpage. However, during the coding process, I encountered an issue where the Google Chrome translation feature couldn't recognize some text due to inconsistent font formats, leading to translation errors and omissions. My code is already capable of extracting text and images from the PDF and displaying them on a webpage, but I wish to standardize the text formatting during the extraction process. I've noticed various font styles in my PDF, and in some cases, I can't even identify the font used. I've browsed through your scripts, repl-font.py and repl-fontnames.py, and asked you some questions, including the need to set up a mapping of old and new fonts. Now, I'd like to initiate a new discussion. Is it possible to standardize the font format during text extraction from the PDF and write it to the webpage, regardless of the original font style? Essentially, I want to convert all text to Arial or Times New Roman because I don't need to preserve the original formatting in my specific use case.
The following is the code I have written.
import fitz # PyMuPDF

Open the PDF file

pdf_document = "myPDF.pdf"
pdf_file = fitz.open(pdf_document)

Create an HTML document and initialize paragraph variables

html_document = ""
current_paragraph = ""
merged = False # Initialize merge flag

Extract text and images

for page_num in range(pdf_file.page_count):
page = pdf_file.load_page(page_num)

# Extract paragraph text
paragraphs = page.get_text("blocks")

for index, paragraph in enumerate(paragraphs):

    text = paragraph[4].strip()  # Remove leading and trailing spaces
    print(text[-1])
    # Check if paragraph merging is needed
    if merged:
        merged = False
        continue

    # If the text contains only one word and is not the last paragraph
    if len(text) == 1 and index < len(paragraphs) - 1 :
        next_paragraph = paragraphs[index + 1]
        next_text = next_paragraph[4].strip()
        # Merge the current paragraph and the next paragraph
        text += next_text
        merged = True

    if text and text[-1] not in [".", "!", "?"]:
        # If it doesn't end with punctuation, merge it into the current paragraph
        if current_paragraph and current_paragraph[-1] == "-":
            current_paragraph = current_paragraph[:-1] + text
        else:
            current_paragraph += " " + text
    else:
        # If it ends with punctuation, add it to the HTML document and reset the current paragraph
        current_paragraph += " " + text
        html_document += f'<p>{current_paragraph}</p>'
        current_paragraph = ""  # Reset the current paragraph

image_list = page.get_images(full=True)
for i, img in enumerate(image_list):
    xref = img[0]
    base_image = pdf_file.extract_image(xref)
    image_data = base_image["image"]
    image_format = base_image["ext"]

    # Save the image to a file
    img_filename = f"image_{page_num}_{i}.{image_format}"
    with open(img_filename, "wb") as img_file:
        img_file.write(image_data)

    # Insert the image link into the HTML
    html_document += f'<img src="{img_filename}" alt="Image {i}" />'

Close the HTML document

html_document += ""

Save the HTML document to a file or publish it to the web

with open("output.html", "w", encoding="utf-8") as html_file:
html_file.write(html_document)

JorjMcKie · 2023-09-27T12:10:41Z

JorjMcKie
Sep 27, 2023
Maintainer

Ah, I see the connection to your other post! Thanks for opening a new post.

Yes, there is no problem to use just one font of your choice to replace all text in the PDF. One of those standard fonts like Arial, Helvetica are a good choice.

3 replies

zhuzhangwei Sep 30, 2023
Author

Hello, Professor, regarding the code from lines 483 to 497 of your script repl-font.py in the PyMuPDF-Utilities/font-replacement repository.
This tw seems to be overwritten every time in the code block:
if color in textwriters.keys(): # already have a textwriter?
tw = textwriters[color] # re-use it
else: # make new
tw = fitz.TextWriter(page.rect) # make text writer
textwriters[color] = tw # store it for later use
Therefore, the tw.append() operation seems to be meaningless because tw appears to be overwritten every time.

I am a programming novice, I have been studying your code for 4 days, and this is my final confusion. I hope you can help me. Thank you very much.

JorjMcKie Oct 1, 2023
Maintainer

This admittedly is complex code - may be not the most suitable if you are new to programming.
Please look at the documentation of TextWriter. It is a container for styled text including positioning.

JorjMcKie Oct 1, 2023
Maintainer

By the way: to show code correctly formatted here, enclose in 2 lines of 3 backtic (`) characters, or click on this to let you help with it:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract text content from a PDF and write it to a webpage with a unified font format? #116

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to extract text content from a PDF and write it to a webpage with a unified font format? #116

zhuzhangwei Sep 27, 2023

Open the PDF file

Create an HTML document and initialize paragraph variables

Extract text and images

Close the HTML document

Save the HTML document to a file or publish it to the web

Replies: 1 comment · 3 replies

JorjMcKie Sep 27, 2023 Maintainer

zhuzhangwei Sep 30, 2023 Author

JorjMcKie Oct 1, 2023 Maintainer

JorjMcKie Oct 1, 2023 Maintainer

zhuzhangwei
Sep 27, 2023

Replies: 1 comment 3 replies

JorjMcKie
Sep 27, 2023
Maintainer

zhuzhangwei Sep 30, 2023
Author

JorjMcKie Oct 1, 2023
Maintainer

JorjMcKie Oct 1, 2023
Maintainer