Invalid Sequences (e.g., /89, /.notdef) in Extracted Text #861

zvictor · 2025-02-01T01:40:50Z

Bug

When converting a PDF to Markdown using docling, the output includes unexpected sequences such as /89, /81, /.notdef, and other numeric or font-related codes interspersed throughout the text. These sequences appear to be PDF internal operators or font references and are not part of the original document’s content. This corrupts the extracted text and makes the output unusable.

Steps to reproduce

Run the command:

docling --from pdf --to md --image-export-mode placeholder https://venda-imoveis.caixa.gov.br/editais/matricula/SC/1444409002480.pdf

Observe the generated Markdown output, which starts with the following invalid sequences:

/89

/81

/89

/88

/118

/87

/82

/79

/34

/.notdef

/.notdef

/83

/83

/82

/86

/85

/81

/4

Docling version

docling 2.17.0

Python version

Python 3.12.0

Additional notes

The issue likely stems from the PDF parser incorrectly handling font mappings, operator codes, or undefined glyphs (e.g., /.notdef). These artifacts suggest the parser is either failing to filter out non-text elements or misinterpreting font/encoding data during extraction.

Therefore, this might be related to DS4SD/docling-parse#92.

The text was updated successfully, but these errors were encountered:

zvictor added the bug Something isn't working label Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid Sequences (e.g., /89, /.notdef) in Extracted Text #861

Invalid Sequences (e.g., /89, /.notdef) in Extracted Text #861

zvictor commented Feb 1, 2025

Invalid Sequences (e.g., /89, /.notdef) in Extracted Text #861

Invalid Sequences (e.g., /89, /.notdef) in Extracted Text #861

Comments

zvictor commented Feb 1, 2025

Bug

Steps to reproduce

Docling version

Python version

Additional notes