How to extract text content from a PDF and write it to a webpage with a unified font format? #116
zhuzhangwei
started this conversation in
Show and tell
Replies: 1 comment 3 replies
-
Ah, I see the connection to your other post! Thanks for opening a new post. Yes, there is no problem to use just one font of your choice to replace all text in the PDF. One of those standard fonts like Arial, Helvetica are a good choice. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello Professor, I would like to extract both text and images from a PDF and output them to a webpage. However, during the coding process, I encountered an issue where the Google Chrome translation feature couldn't recognize some text due to inconsistent font formats, leading to translation errors and omissions. My code is already capable of extracting text and images from the PDF and displaying them on a webpage, but I wish to standardize the text formatting during the extraction process. I've noticed various font styles in my PDF, and in some cases, I can't even identify the font used. I've browsed through your scripts, repl-font.py and repl-fontnames.py, and asked you some questions, including the need to set up a mapping of old and new fonts. Now, I'd like to initiate a new discussion. Is it possible to standardize the font format during text extraction from the PDF and write it to the webpage, regardless of the original font style? Essentially, I want to convert all text to Arial or Times New Roman because I don't need to preserve the original formatting in my specific use case.
The following is the code I have written.
import fitz # PyMuPDF
Open the PDF file
pdf_document = "myPDF.pdf"
pdf_file = fitz.open(pdf_document)
Create an HTML document and initialize paragraph variables
html_document = ""
current_paragraph = ""
merged = False # Initialize merge flag
Extract text and images
for page_num in range(pdf_file.page_count):
page = pdf_file.load_page(page_num)
Close the HTML document
html_document += ""
Save the HTML document to a file or publish it to the web
with open("output.html", "w", encoding="utf-8") as html_file:
html_file.write(html_document)
Beta Was this translation helpful? Give feedback.
All reactions