You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When converting a PDF to Markdown using docling, the output includes unexpected sequences such as /89, /81, /.notdef, and other numeric or font-related codes interspersed throughout the text. These sequences appear to be PDF internal operators or font references and are not part of the original document’s content. This corrupts the extracted text and makes the output unusable.
The issue likely stems from the PDF parser incorrectly handling font mappings, operator codes, or undefined glyphs (e.g., /.notdef). These artifacts suggest the parser is either failing to filter out non-text elements or misinterpreting font/encoding data during extraction.
Bug
When converting a PDF to Markdown using
docling
, the output includes unexpected sequences such as/89
,/81
,/.notdef
, and other numeric or font-related codes interspersed throughout the text. These sequences appear to be PDF internal operators or font references and are not part of the original document’s content. This corrupts the extracted text and makes the output unusable.Steps to reproduce
Docling version
docling 2.17.0
Python version
Python 3.12.0
Additional notes
The issue likely stems from the PDF parser incorrectly handling font mappings, operator codes, or undefined glyphs (e.g.,
/.notdef
). These artifacts suggest the parser is either failing to filter out non-text elements or misinterpreting font/encoding data during extraction.Therefore, this might be related to DS4SD/docling-parse#92.
The text was updated successfully, but these errors were encountered: