How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136
Unanswered
thangarajdeivasikamani
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello Team,
Consider my pdf sheet as like https://www.st.com/resource/en/datasheet/stm32f205rb.pdf.
In that I have used below code to extract the image. But I am not getting proper images actually available in the pdf.
import sys, pymupdf # import the bindings
fname = "stm32f103c8.pdf" # get filename from command line
doc = pymupdf.open(fname) # open document
iterate over the pages
for page in doc:
img_number = 0 # for enumerating images per page
# iterate over the image blocks
for block in page.get_text("dict")["blocks"]:
# skip if no image block
if block["type"] != 1:
continue
# build filename, like 'img17-3.jpg'
name = f"img{page.number}-{img_number}.{block['ext']}"
out = open(name, "wb")
out.write(block["image"]) # write the binary content
out.close()
img_number += 1 # increase image counter
Some time the reputative footer image logo, side image logo only consider as the image and extracting. Actual image extraction missing.
![image](https://private-user-images.githubusercontent.com/46878296/336181552-95c5c69c-1008-42db-a8a9-c52ee84a8e98.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1ODg1NjIsIm5iZiI6MTczOTU4ODI2MiwicGF0aCI6Ii80Njg3ODI5Ni8zMzYxODE1NTItOTVjNWM2OWMtMTAwOC00MmRiLWE4YTktYzUyZWU4NGE4ZTk4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDAyNTc0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk3ODZjYTdjYjI3M2Y3YzkzZWM3NTJjOWRiMjU4NmY1MWUxYjA5Y2M0NzM2N2ExMDhhNTU3YzU5ODI1ZjhjYTImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.7xmDrAsAK90ED6ztB6uSpAf-VCSeHABp8W4_3REJUyM)
Even I tried with below code. It's not extracting the required images
https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-images/extract-from-xref.py
Beta Was this translation helpful? Give feedback.
All reactions