AttributeError: 'DictionaryObject' object has no attribute 'get_data' #2995

neeraj9 · 2024-12-07T17:42:26Z

Trying to extract text from one of the PDF led to an error in extracting text.

Additional info when error happened (see traceback later)

s.get_object() = {'/Filter': '/FlateDecode', '/Length': 629}
dir(s.get_object()) = ['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__or__', '__orig_bases__', '__parameters__', '__protocol_attrs__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_clone', '_is_protocol', '_is_runtime_protocol', '_reference_clone', 'clear', 'clone', 'copy', 'fromkeys', 'get', 'get_inherited', 'get_object', 'hash_bin', 'hash_func', 'hash_value', 'hash_value_data', 'indirect_reference', 'items', 'keys', 'pop', 'popitem', 'raw_get', 'read_from_stream', 'replicate', 'setdefault', 'update', 'values', 'write_to_stream', 'xmp_metadata']
type(s.get_object()) = <class 'pypdf.generic._data_structures.DictionaryObject'>

Added traces to pypdf/generic/_data_structures.py

if isinstance(stream, ArrayObject):
                data = b""
                for s in stream:
                    print(f"s.get_object() = {s.get_object()}")
                    print(f"dir(s.get_object()) = {dir(s.get_object())}")
                    print(f"type(s.get_object()) = {type(s.get_object())}")
                    data += s.get_object().get_data()
                    if len(data) == 0 or data[-1] != b"\n":
                        data += b"\n"
                super().set_data(bytes(data))
            else:
                stream_data = stream.get_data()
                assert stream_data is not None
                super().set_data(stream_data)

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

for page_num in range_of_pages:
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()
        page_text = page_text.strip()
        if not page_text:
            page_num_without_text.append(page_num + 1)
        page_texts.append(page_text)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

Traceback

This is the complete traceback I see:

  File "venv\Lib\site-packages\pypdf\_page.py", line 2397, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_page.py", line 1882, in _extract_text
    content = ContentStream(content, pdf, "bytes")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\generic\_data_structures.py", line 1181, in __init__
    data += s.get_object().get_data()
            ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DictionaryObject' object has no attribute 'get_data'

6fa5fd46-5f98-4a67-800d-5e2362b0164f.pdf

Workaround to get past issue

This is NOT a FIX, but a temporary workaround to get past the text extraction error.

diff --git a/pypdf/_page.py b/pypdf/_page.py
index 5cbbc49..e5e9ee0 100644
--- a/pypdf/_page.py
+++ b/pypdf/_page.py
@@ -2108,15 +2108,16 @@ class PageObject(DictionaryObject):
             elif operator == b"TJ":
                 # The space width may be smaller than the font width, so the width should be 95%.
                 _confirm_space_width = _space_width * 0.95
-                for op in operands[0]:
-                    if isinstance(op, (str, bytes)):
-                        process_operation(b"Tj", [op])
-                    if isinstance(op, (int, float, NumberObject, FloatObject)) and (
-                        (abs(float(op)) >= _confirm_space_width)
-                        and (len(text) > 0)
-                        and (text[-1] != " ")
-                    ):
-                        process_operation(b"Tj", [" "])
+                if len(operands) > 0:
+                    for op in operands[0]:
+                        if isinstance(op, (str, bytes)):
+                            process_operation(b"Tj", [op])
+                        if isinstance(op, (int, float, NumberObject, FloatObject)) and (
+                            (abs(float(op)) >= _confirm_space_width)
+                            and (len(text) > 0)
+                            and (text[-1] != " ")
+                        ):
+                            process_operation(b"Tj", [" "])
             elif operator == b"Do":
                 output += text
                 if visitor_text is not None:
diff --git a/pypdf/generic/_data_structures.py b/pypdf/generic/_data_structures.py
index e525b87..08a2d84 100644
--- a/pypdf/generic/_data_structures.py
+++ b/pypdf/generic/_data_structures.py
@@ -1178,7 +1178,10 @@ class ContentStream(DecodedStreamObject):
             if isinstance(stream, ArrayObject):
                 data = b""
                 for s in stream:
-                    data += s.get_object().get_data()
+                    try:
+                        data += s.get_object().get_data()
+                    except AttributeError:
+                        pass
                     if len(data) == 0 or data[-1] != b"\n":
                         data += b"\n"
                 super().set_data(bytes(data))

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-12-07T18:04:11Z

Thanks for the report.

I am not completely sure about this - there already are lots of issues reported beforehand. The underlying issue is an invalid content stream specifier:

^M65 0 obj^M<</Filter/FlateDecode/Length 629^Xck["^Sª<82>ÑK^

A valid stream specifier would look like this:

^M64 0 obj^M<</Filter/FlateDecode/Length 544>>stream^M
H<89><9c>TKo^SA^L¾ï¯ðq^Gu<9c>y?<8e>

stefan6419846 · 2024-12-07T20:15:41Z

By the way: The proposed workaround seems to include fixes unrelated to this issue.

neeraj9 · 2024-12-08T05:00:23Z

By the way: The proposed workaround seems to include fixes unrelated to this issue.

Yes. There was another issue wherein operands was empty and this diff also went in with this bug when it should have been another one.

Closes py-pdf#2995.

stefan6419846 added the is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. label Dec 7, 2024

stefan6419846 added a commit to stefan6419846/pypdf that referenced this issue Dec 19, 2024

ROB: Deal with content streams not containing streams

18b92bf

Closes py-pdf#2995.

stefan6419846 linked a pull request Dec 19, 2024 that will close this issue

ROB: Deal with content streams not containing streams #3005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'DictionaryObject' object has no attribute 'get_data' #2995

AttributeError: 'DictionaryObject' object has no attribute 'get_data' #2995

neeraj9 commented Dec 7, 2024 •

edited

Loading

stefan6419846 commented Dec 7, 2024

stefan6419846 commented Dec 7, 2024

neeraj9 commented Dec 8, 2024

AttributeError: 'DictionaryObject' object has no attribute 'get_data' #2995

AttributeError: 'DictionaryObject' object has no attribute 'get_data' #2995

Comments

neeraj9 commented Dec 7, 2024 • edited Loading

Environment

Code + PDF

Traceback

Workaround to get past issue

stefan6419846 commented Dec 7, 2024

stefan6419846 commented Dec 7, 2024

neeraj9 commented Dec 8, 2024

neeraj9 commented Dec 7, 2024 •

edited

Loading