How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136

thangarajdeivasikamani · 2024-06-03T18:15:08Z

thangarajdeivasikamani
Jun 3, 2024

Hello Team,

Consider my pdf sheet as like https://www.st.com/resource/en/datasheet/stm32f205rb.pdf.

In that I have used below code to extract the image. But I am not getting proper images actually available in the pdf.

import sys, pymupdf # import the bindings
fname = "stm32f103c8.pdf" # get filename from command line
doc = pymupdf.open(fname) # open document

iterate over the pages

for page in doc:
img_number = 0 # for enumerating images per page
# iterate over the image blocks
for block in page.get_text("dict")["blocks"]:
# skip if no image block
if block["type"] != 1:
continue
# build filename, like 'img17-3.jpg'
name = f"img{page.number}-{img_number}.{block['ext']}"
out = open(name, "wb")
out.write(block["image"]) # write the binary content
out.close()
img_number += 1 # increase image counter

Some time the reputative footer image logo, side image logo only consider as the image and extracting. Actual image extraction missing.

Even I tried with below code. It's not extracting the required images

https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-images/extract-from-xref.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136

thangarajdeivasikamani Jun 3, 2024

iterate over the pages

Replies: 0 comments

thangarajdeivasikamani
Jun 3, 2024