Fix document loading bug #10

VikParuchuri · 2024-10-08T13:12:56Z

There was a bug where it was assumed that input paths would always be of str type - this is not always the case

mara004 · 2024-10-08T18:53:02Z

I guess the reason @hector-sherpas did that is the many issues in marker's bug tracker with some caller(s) in pdftext apparently passing a PdfDocument instance into _load_pdf(): VikParuchuri/marker#235, VikParuchuri/marker#183, VikParuchuri/marker#137

However, this attempt allowed only string input. The right fix would be to check just for isinstance(..., PdfDocument), and pass the input to PdfDocument only if False.
Or even better, fix the callers that already have a PdfDocument object to not call _load_pdf() on that object again.

VikParuchuri · 2024-10-08T22:25:02Z

I think the issues there were with mismatched pdftext/marker versions, which got cached in some regional pypi caches. Not sure why the caches didn't clear on update

VikParuchuri · 2024-10-08T22:26:40Z

Also @mara004 , thanks for all the comments! If you're interested in doing some paid maintainer work on these repos, please email me :)

hector-sherpas · 2024-10-09T07:45:12Z

I did it this way, as says @mara004, so that it could be used from Marker, where the PdfDocument object is instantiated and then the dictionary_output function of pdftext is called, passing it the path of the document where the PdfDocument object is instantiated again. Therefore, what would have been done with the first instance of PdfDocument in Marker (e.g. preprocessing) is lost.

Also, I only put the str case because of the type of the pdftext argument in question.

def main():
    parser = argparse.ArgumentParser(description="Extract plain text from PDF.  Not guaranteed to be in order.")
    parser.add_argument("pdf_path", type=str, help="Path to the PDF file")
    parser.add_argument("--out_path", type=str, help="Path to the output text file, defaults to stdout", default=None)
    parser.add_argument("--json", action="store_true", help="Output json instead of plain text", default=False)
    parser.add_argument("--sort", action="store_true", help="Attempt to sort the text by reading order", default=False)
    parser.add_argument("--keep_hyphens", action="store_true", help="Keep hyphens in words", default=False)
    parser.add_argument("--pages", type=str, help="Comma separated pages to extract, like 1,2,3", default=None)
    parser.add_argument("--keep_chars", action="store_true", help="Keep character level information", default=False)
    parser.add_argument("--workers", type=int, help="Number of workers to use for parallel processing", default=None)
    args = parser.parse_args()

That's why I considered those two cases, argument of type str (already there) and I added that it could be a PdfDocument object. With the current change it can't be used from Marker as I said.

The options given by @mara004 allow you to expand the cases and use it from Marker in this way.

mara004 · 2024-10-09T16:30:46Z

Also @mara004 , thanks for all the comments! If you're interested in doing some paid maintainer work on these repos, please email me :)

Thanks a lot for the offer, but I think pypdfium2, its ctypesgen fork, and some other projects I'm trying to get off ground are more than enough work for me.¹ Also, layout analysis / machine learning is not my field. pdftext/marker's internals are beyond my understanding and I could never have created what you have done here. All I can do is try to help a bit with pypdfium2 integration. Out of interest, I occasionally search for pypdfium2 issues in the bug trackers of popular embedders, and leave comments where I see fit.

That said, I'm glad you seem to perceive my comments as helpful rather than annoying. :)
Apologies in case I'm too noisy or my choice of words tends to sound rude at times.

TBH, it has all grown a bit above my head, and I'm not sure how long I can continue maintaining the stuff. I'm sort of looking for a successor... ↩

Fix bug

4c61c85

VikParuchuri merged commit 56af2c1 into master Oct 8, 2024
1 check passed

mara004 mentioned this pull request Oct 8, 2024

TypeError: Invalid input type 'PdfDocument' VikParuchuri/marker#235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix document loading bug #10

Fix document loading bug #10

VikParuchuri commented Oct 8, 2024

mara004 commented Oct 8, 2024 •

edited

Loading

VikParuchuri commented Oct 8, 2024 •

edited

Loading

VikParuchuri commented Oct 8, 2024

hector-sherpas commented Oct 9, 2024 •

edited

Loading

mara004 commented Oct 9, 2024 •

edited

Loading

Fix document loading bug #10

Fix document loading bug #10

Conversation

VikParuchuri commented Oct 8, 2024

mara004 commented Oct 8, 2024 • edited Loading

VikParuchuri commented Oct 8, 2024 • edited Loading

VikParuchuri commented Oct 8, 2024

hector-sherpas commented Oct 9, 2024 • edited Loading

mara004 commented Oct 9, 2024 • edited Loading

Footnotes

mara004 commented Oct 8, 2024 •

edited

Loading

VikParuchuri commented Oct 8, 2024 •

edited

Loading

hector-sherpas commented Oct 9, 2024 •

edited

Loading

mara004 commented Oct 9, 2024 •

edited

Loading