Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Sequences (e.g., /89, /.notdef) in Extracted Text #861

Open
zvictor opened this issue Feb 1, 2025 · 0 comments
Open

Invalid Sequences (e.g., /89, /.notdef) in Extracted Text #861

zvictor opened this issue Feb 1, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@zvictor
Copy link

zvictor commented Feb 1, 2025

Bug

When converting a PDF to Markdown using docling, the output includes unexpected sequences such as /89, /81, /.notdef, and other numeric or font-related codes interspersed throughout the text. These sequences appear to be PDF internal operators or font references and are not part of the original document’s content. This corrupts the extracted text and makes the output unusable.

Steps to reproduce

  1. Run the command:
    docling --from pdf --to md --image-export-mode placeholder https://venda-imoveis.caixa.gov.br/editais/matricula/SC/1444409002480.pdf  
  2. Observe the generated Markdown output, which starts with the following invalid sequences:
/89

/81

/89

/88

/118

/87

/82

/79

/34

/.notdef

/.notdef

/83

/83

/82

/86

/85

/81

/4

Docling version

docling 2.17.0

Python version

Python 3.12.0

Additional notes

The issue likely stems from the PDF parser incorrectly handling font mappings, operator codes, or undefined glyphs (e.g., /.notdef). These artifacts suggest the parser is either failing to filter out non-text elements or misinterpreting font/encoding data during extraction.

Therefore, this might be related to DS4SD/docling-parse#92.

@zvictor zvictor added the bug Something isn't working label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant