You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
When processing some PDF files using Docling’s Tesseract CLI OCR engine (via the tesseract_ocr_cli_model), an error is raised during the document conversion pipeline. The error message is:
AttributeError: Can only use .str accessor with string values!. Did you mean: 'std'?
This error occurs in the method _run_tesseract when it attempts to filter the “text” column of a DataFrame using the pandas .str accessor. The error is intermittent—occurring on about 25% of files—and prevents output from being generated for those documents.
Steps to Reproduce:
Use Docling (version 2.15.1) on Windows 11 with Python 3.10 and pandas 2.3.x.
Configure the PDF pipeline with OCR enabled (using Tesseract CLI OCR, e.g., via TesseractCliOcrOptions).
Process a set of fully rasterized PDF documents (we have observed the error on some files, e.g., "FAKE, NAME3.pdf" but not on most files). Thes files OCR well when other ocr engines are used.
Actual Behavior:
For a subset of PDF files, Docling fails during the OCR processing stage with the following traceback (excerpt):
...
File "docling\models\tesseract_ocr_cli_model.py", line 106, in _run_tesseract
df_filtered = df[df["text"].notnull() & (df["text"].str.strip() != "")]
^^^^^^^^^^^^^^
File "...pandas\core\accessor.py", line 245, in _validate
raise AttributeError("Can only use .str accessor with string values!") AttributeError: Can only use .str accessor with string values!. Did you mean: 'std'?
Troubleshooting Attempts:
We have attempted several monkey patches to address this error, for example:
Filling NaN and Applying str() Conversion:
We tried:
df["text"] = df["text"].fillna("").apply(lambda x: str(x))
df_filtered = df[df["text"].str.strip() != ""]
Using List Comprehension and Forcing a Pandas Series with dtype="string":
We also attempted:
python
Copy
df["text"] = pd.Series([("" if pd.isna(x) else str(x)) for x in df["text"]], index=df.index, dtype="string")
df_filtered = df[df["text"].str.strip() != ""]
Row-wise Filtering Without .str:
Finally, we tried:
df_filtered = df[df["text"].apply(lambda x: isinstance(x, str) and x.strip() != "")]
Despite these approaches, the error persists on some files. We even attempted reloading the module with importlib.reload to ensure that our patch was in effect, but the failure still occurs.
Tesseract CLI OCR Options: Configured via TesseractCliOcrOptions with force_full_page_ocr=True
PDF Backend: Tested with both DLPARSE_V2 and pypdfium
Additional Context:
We are processing fully rasterized PDF patient charts (approximately 50 pages each). Our other OCR engines (EasyOCR, PyTesseract) do not exhibit this error, but the Tesseract CLI branch fails on some files. The error appears to occur early in the Docling pipeline during table and layout processing.
Questions:
Has anyone encountered this issue with non-string values in the “text” column when using Docling’s Tesseract CLI OCR?
Is there a recommended way to pre-process or intercept the DataFrame before filtering so that the .str accessor can be used reliably?
Could this be a bug in the Docling implementation of _run_tesseract, and if so, is there a workaround or patch we should apply upstream?
...
This is my first github report, if I did it wrong let me know, AI outlined it for me, but Its largely human generated.
The text was updated successfully, but these errors were encountered:
Description:
When processing some PDF files using Docling’s Tesseract CLI OCR engine (via the tesseract_ocr_cli_model), an error is raised during the document conversion pipeline. The error message is:
AttributeError: Can only use .str accessor with string values!. Did you mean: 'std'?
This error occurs in the method _run_tesseract when it attempts to filter the “text” column of a DataFrame using the pandas .str accessor. The error is intermittent—occurring on about 25% of files—and prevents output from being generated for those documents.
Steps to Reproduce:
Use Docling (version 2.15.1) on Windows 11 with Python 3.10 and pandas 2.3.x.
Configure the PDF pipeline with OCR enabled (using Tesseract CLI OCR, e.g., via TesseractCliOcrOptions).
Process a set of fully rasterized PDF documents (we have observed the error on some files, e.g., "FAKE, NAME3.pdf" but not on most files). Thes files OCR well when other ocr engines are used.
Actual Behavior:
For a subset of PDF files, Docling fails during the OCR processing stage with the following traceback (excerpt):
...
File "docling\models\tesseract_ocr_cli_model.py", line 106, in _run_tesseract
df_filtered = df[df["text"].notnull() & (df["text"].str.strip() != "")]
^^^^^^^^^^^^^^
File "...pandas\core\accessor.py", line 245, in _validate
raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!. Did you mean: 'std'?
Troubleshooting Attempts:
We have attempted several monkey patches to address this error, for example:
Filling NaN and Applying str() Conversion:
We tried:
df["text"] = df["text"].fillna("").apply(lambda x: str(x))
df_filtered = df[df["text"].str.strip() != ""]
Using List Comprehension and Forcing a Pandas Series with dtype="string":
We also attempted:
python
Copy
df["text"] = pd.Series([("" if pd.isna(x) else str(x)) for x in df["text"]], index=df.index, dtype="string")
df_filtered = df[df["text"].str.strip() != ""]
Row-wise Filtering Without .str:
Finally, we tried:
df_filtered = df[df["text"].apply(lambda x: isinstance(x, str) and x.strip() != "")]
Despite these approaches, the error persists on some files. We even attempted reloading the module with importlib.reload to ensure that our patch was in effect, but the failure still occurs.
Environment Details:
OS: Windows 11
Python Version: 3.10
Pandas Version: 2.3.x
Docling Version:2.15.1
Tesseract CLI OCR Options: Configured via TesseractCliOcrOptions with force_full_page_ocr=True
PDF Backend: Tested with both DLPARSE_V2 and pypdfium
Additional Context:
We are processing fully rasterized PDF patient charts (approximately 50 pages each). Our other OCR engines (EasyOCR, PyTesseract) do not exhibit this error, but the Tesseract CLI branch fails on some files. The error appears to occur early in the Docling pipeline during table and layout processing.
Questions:
Has anyone encountered this issue with non-string values in the “text” column when using Docling’s Tesseract CLI OCR?
Is there a recommended way to pre-process or intercept the DataFrame before filtering so that the .str accessor can be used reliably?
Could this be a bug in the Docling implementation of _run_tesseract, and if so, is there a workaround or patch we should apply upstream?
...
This is my first github report, if I did it wrong let me know, AI outlined it for me, but Its largely human generated.
The text was updated successfully, but these errors were encountered: