-
Hello, Thank you. Sincerely |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
First problem is, that it is no image. >>> import fitz
>>> doc=fitz.open("BOURGEOISFacture.21053886.pdf")
>>> page=doc[0]
>>> from pprint import pprint
>>> # no standard images there:
>>> pprint(page.get_images())
[]
>>> # neither embedded images:
>>> pprint(page.get_image_info())
[] And you will have determined that what seems to be text neither is text. >>> page.get_text(sort=True) # "sort" ensures text visible at the top indeed comes first
BENOIT CASTEL MENILMONTANT
Boulangerie Pâtisserie
150 rue de Ménilmontant
75020 PARIS
06 21 08 29 68
Tél. Client :
FACTURE
/
1
... (more data) ... So the apparent text must be encoded as drawing primitives, like a capital "B" drawn as a vertical line >>> rl1 = page.search_for("BENOIT CASTEL MENILMONTANT")
>>> len(rl1)
3
>>> rl2 = page.search_for("FACTURE")
>>> len(rl2)
2
>>> # sort both rectangle lists to be sure: vertical, then horizontal
>>> rl1.sort(key=lambda r: (r.y1, r.x0))
>>> rl2.sort(key=lambda r: (r.y1, r.x0))
>>> # right border is left or first rl1 rect:
>>> rborder = rl1[0].x0
>>> # bottom is top coord of first rect of rl2:
>>> bottom = rl2[0].y0
>>> # define sub rect to OCR:
>>> clip = fitz.Rect(0,0,rborder,bottom)
>>> clip
Rect(0.0, 0.0, 339.3900146484375, 220.2386474609375)
>>> # make a pixmap of that rect:
>>> pix = page.get_pixmap(dpi=300,clip=clip)
>>> # make a new 1 page PDF with OCRed text
>>> pdfbytes = pix.pdfocr_tobytes()
>>> ocrpdf = fitz.open("pdf", pdfbytes)
>>> ocrpage = ocrpdf[0]
>>> print(ocrpage.get_text())
Bourgeois Fréres S.A.S au Capital de 330 000 Euro
77510 VERDELOT
TEL
: 01 64 04 81 04
FAX
: 01 64 04 81 43
SITE INTERNET
: WWW.MOULINS-BOURGEOIS.COM
N° SIRET 746 050 087 00012
R.C. MEAUXB 746 050 087
CODE APE: 1061A
N° de TVA INTRACOMMUNAUTAIRE
: FR15746050087
>>> |
Beta Was this translation helpful? Give feedback.
-
Of course that OCRed text could also have been extracted with coordinates, e.g. using |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot
Which package must be imported to works?
Sincerely
Yves Larbodiere
EVARD
3, rue des Courtes Terres
95220 HERBLAY
portable : 07 81 08 41 00
mail : ***@***.***
… Le 21 janv. 2022 à 16:24, Jorj X. McKie ***@***.***> a écrit :
Of course that OCRed text could also have been extracted with coordinates, e.g. using ocrpage.get_text("dict").
Those coordinates obviously are relative to that ocrpage's dimensions. If required translate them back to the original page's positions using the matrix mat = ocrpage.rect.torect(clip).
—
Reply to this email directly, view it on GitHub <#32 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AU2RHDFZP4VARUR5INAOVT3UXF3BBANCNFSM5MPUEXAQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.
|
Beta Was this translation helpful? Give feedback.
-
Hello, I'm trying to reproduce your code, with the same file, I have an error code. Sincerel |
Beta Was this translation helpful? Give feedback.
Please read the documentation - installation chapter.
In order to use PyMuPDF's OCR function, Tesseract-OCR mst be installed and that said environment variable must be set ...