How to extract text from image #34

Larbo53 · 2022-01-21T14:14:48Z

Larbo53
Jan 21, 2022

Hello,
I would like to extract the supplier's information (top left block: Bourgeois Frères, . (see attached image) from the attached pdf.
What is the solution?
Os : MacOs Bigsur 11.6
Python : 3.9
PyMuPDF : 1.19.0

Thank you.

Sincerely

BOURGEOISFacture 21053886.pdf

y

Answered by JorjMcKie

Jan 23, 2022

Please read the documentation - installation chapter.
In order to use PyMuPDF's OCR function, Tesseract-OCR mst be installed and that said environment variable must be set ...

View full answer

JorjMcKie · 2022-01-21T15:05:44Z

JorjMcKie
Jan 21, 2022
Maintainer

First problem is, that it is no image.

>>> import fitz
>>> doc=fitz.open("BOURGEOISFacture.21053886.pdf")
>>> page=doc[0]
>>> from pprint import pprint
>>> # no standard images there:
>>> pprint(page.get_images())
[]
>>> # neither embedded images:
>>> pprint(page.get_image_info())
[]

And you will have determined that what seems to be text neither is text.

>>> page.get_text(sort=True)  # "sort" ensures text visible at the top indeed comes first
BENOIT CASTEL MENILMONTANT 
Boulangerie Pâtisserie 
150 rue de Ménilmontant 
75020 PARIS 
06 21 08 29 68
Tél. Client :
FACTURE
/
1
... (more data) ...

So the apparent text must be encoded as drawing primitives, like a capital "B" drawn as a vertical line | followed by two little semi-circles above each other, etc.
The only way is therefore OCRing this area. Let's find a suitable sub-rectangle: left of "BENOIT CASTEL MENILMONTANT" and above the first "FACTURE":

>>> rl1 = page.search_for("BENOIT CASTEL MENILMONTANT")
>>> len(rl1)
3
>>> rl2 = page.search_for("FACTURE")
>>> len(rl2)
2
>>> # sort both rectangle lists to be sure: vertical, then horizontal
>>> rl1.sort(key=lambda r: (r.y1, r.x0))
>>> rl2.sort(key=lambda r: (r.y1, r.x0))
>>> # right border is left or first rl1 rect:
>>> rborder = rl1[0].x0
>>> # bottom is top coord of first rect of rl2:
>>> bottom = rl2[0].y0
>>> # define sub rect to OCR:
>>> clip = fitz.Rect(0,0,rborder,bottom)
>>> clip
Rect(0.0, 0.0, 339.3900146484375, 220.2386474609375)
>>> # make a pixmap of that rect:
>>> pix = page.get_pixmap(dpi=300,clip=clip)
>>> # make a new 1 page PDF with OCRed text
>>> pdfbytes = pix.pdfocr_tobytes()
>>> ocrpdf = fitz.open("pdf", pdfbytes)
>>> ocrpage = ocrpdf[0]
>>> print(ocrpage.get_text())
Bourgeois Fréres S.A.S au Capital de 330 000 Euro
77510 VERDELOT
TEL
: 01 64 04 81 04
FAX
: 01 64 04 81 43
SITE INTERNET
: WWW.MOULINS-BOURGEOIS.COM
N° SIRET 746 050 087 00012
R.C. MEAUXB 746 050 087
CODE APE: 1061A
N° de TVA INTRACOMMUNAUTAIRE
: FR15746050087

>>>

0 replies

JorjMcKie · 2022-01-21T15:23:49Z

JorjMcKie
Jan 21, 2022
Maintainer

Of course that OCRed text could also have been extracted with coordinates, e.g. using ocrpage.get_text("dict").
Those coordinates obviously are relative to that ocrpage's dimensions. If required translate them back to the original page's positions using the matrix mat = ocrpage.rect.torect(clip).

0 replies

Larbo53 · 2022-01-21T15:26:09Z

Larbo53
Jan 21, 2022
Author

Thanks a lot Which package must be imported to works? Sincerely Yves Larbodiere EVARD 3, rue des Courtes Terres 95220 HERBLAY portable : 07 81 08 41 00 mail : ***@***.***

…

Le 21 janv. 2022 à 16:24, Jorj X. McKie ***@***.***> a écrit : Of course that OCRed text could also have been extracted with coordinates, e.g. using ocrpage.get_text("dict"). Those coordinates obviously are relative to that ocrpage's dimensions. If required translate them back to the original page's positions using the matrix mat = ocrpage.rect.torect(clip). — Reply to this email directly, view it on GitHub <#32 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AU2RHDFZP4VARUR5INAOVT3UXF3BBANCNFSM5MPUEXAQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.

1 reply

JorjMcKie Jan 21, 2022
Maintainer

To install:

python -m pip install -U pip
python -m pip install pymupdf

Do not install package fitz in the same environment!!!
This is a completely different, unrelated thing. Some person (named FitzXYZ) was unfortunate enough to use parts its last name. That packaged still in its very early stages.

The reason is that PyMuPDF has fitz as its top-level text, which must be used as imported name: import fitz.

Larbo53 · 2022-01-23T14:57:52Z

Larbo53
Jan 23, 2022
Author

Hello,

I'm trying to reproduce your code, with the same file, I have an error code.
Configuration :
MacOs Bigsur version 11.6
Thank you for your feedback.

Sincerel
test_mupdf_ocr_error.txt
test_mupdf_ocr.txt
y

1 reply

JorjMcKie Jan 23, 2022
Maintainer

Please read the documentation - installation chapter.
In order to use PyMuPDF's OCR function, Tesseract-OCR mst be installed and that said environment variable must be set ...

Answer selected by JorjMcKie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract text from image #34

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to extract text from image #34

Larbo53 Jan 21, 2022

Replies: 4 comments · 2 replies

JorjMcKie Jan 21, 2022 Maintainer

JorjMcKie Jan 21, 2022 Maintainer

Larbo53 Jan 21, 2022 Author

JorjMcKie Jan 21, 2022 Maintainer

Larbo53 Jan 23, 2022 Author

JorjMcKie Jan 23, 2022 Maintainer

Larbo53
Jan 21, 2022

Replies: 4 comments 2 replies

JorjMcKie
Jan 21, 2022
Maintainer

JorjMcKie
Jan 21, 2022
Maintainer

Larbo53
Jan 21, 2022
Author

JorjMcKie Jan 21, 2022
Maintainer

Larbo53
Jan 23, 2022
Author

JorjMcKie Jan 23, 2022
Maintainer