Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract CLI OCR Fails with "Can only use .str accessor with string values!" Error on Some PDF Files #877

Open
noodleclaus opened this issue Feb 3, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@noodleclaus
Copy link

Description:
When processing some PDF files using Docling’s Tesseract CLI OCR engine (via the tesseract_ocr_cli_model), an error is raised during the document conversion pipeline. The error message is:

AttributeError: Can only use .str accessor with string values!. Did you mean: 'std'?

This error occurs in the method _run_tesseract when it attempts to filter the “text” column of a DataFrame using the pandas .str accessor. The error is intermittent—occurring on about 25% of files—and prevents output from being generated for those documents.

Steps to Reproduce:

Use Docling (version 2.15.1) on Windows 11 with Python 3.10 and pandas 2.3.x.
Configure the PDF pipeline with OCR enabled (using Tesseract CLI OCR, e.g., via TesseractCliOcrOptions).
Process a set of fully rasterized PDF documents (we have observed the error on some files, e.g., "FAKE, NAME3.pdf" but not on most files). Thes files OCR well when other ocr engines are used.

Actual Behavior:
For a subset of PDF files, Docling fails during the OCR processing stage with the following traceback (excerpt):

...
File "docling\models\tesseract_ocr_cli_model.py", line 106, in _run_tesseract
df_filtered = df[df["text"].notnull() & (df["text"].str.strip() != "")]
^^^^^^^^^^^^^^
File "...pandas\core\accessor.py", line 245, in _validate
raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!. Did you mean: 'std'?
Troubleshooting Attempts:
We have attempted several monkey patches to address this error, for example:

Filling NaN and Applying str() Conversion:
We tried:

df["text"] = df["text"].fillna("").apply(lambda x: str(x))
df_filtered = df[df["text"].str.strip() != ""]
Using List Comprehension and Forcing a Pandas Series with dtype="string":
We also attempted:
python
Copy
df["text"] = pd.Series([("" if pd.isna(x) else str(x)) for x in df["text"]], index=df.index, dtype="string")
df_filtered = df[df["text"].str.strip() != ""]
Row-wise Filtering Without .str:
Finally, we tried:
df_filtered = df[df["text"].apply(lambda x: isinstance(x, str) and x.strip() != "")]
Despite these approaches, the error persists on some files. We even attempted reloading the module with importlib.reload to ensure that our patch was in effect, but the failure still occurs.

Environment Details:

OS: Windows 11
Python Version: 3.10
Pandas Version: 2.3.x
Docling Version:2.15.1

Tesseract CLI OCR Options: Configured via TesseractCliOcrOptions with force_full_page_ocr=True
PDF Backend: Tested with both DLPARSE_V2 and pypdfium
Additional Context:
We are processing fully rasterized PDF patient charts (approximately 50 pages each). Our other OCR engines (EasyOCR, PyTesseract) do not exhibit this error, but the Tesseract CLI branch fails on some files. The error appears to occur early in the Docling pipeline during table and layout processing.

Questions:

Has anyone encountered this issue with non-string values in the “text” column when using Docling’s Tesseract CLI OCR?
Is there a recommended way to pre-process or intercept the DataFrame before filtering so that the .str accessor can be used reliably?
Could this be a bug in the Docling implementation of _run_tesseract, and if so, is there a workaround or patch we should apply upstream?
...
This is my first github report, if I did it wrong let me know, AI outlined it for me, but Its largely human generated.

@noodleclaus noodleclaus added the bug Something isn't working label Feb 3, 2025
@josippavicic
Copy link

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants