Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix document loading bug #10

Merged
merged 1 commit into from
Oct 8, 2024
Merged

Fix document loading bug #10

merged 1 commit into from
Oct 8, 2024

Conversation

VikParuchuri
Copy link
Owner

  • There was a bug where it was assumed that input paths would always be of str type - this is not always the case

@VikParuchuri VikParuchuri merged commit 56af2c1 into master Oct 8, 2024
1 check passed
@mara004
Copy link
Contributor

mara004 commented Oct 8, 2024

I guess the reason @hector-sherpas did that is the many issues in marker's bug tracker with some caller(s) in pdftext apparently passing a PdfDocument instance into _load_pdf(): VikParuchuri/marker#235, VikParuchuri/marker#183, VikParuchuri/marker#137

However, this attempt allowed only string input. The right fix would be to check just for isinstance(..., PdfDocument), and pass the input to PdfDocument only if False.
Or even better, fix the callers that already have a PdfDocument object to not call _load_pdf() on that object again.

@VikParuchuri
Copy link
Owner Author

VikParuchuri commented Oct 8, 2024

I think the issues there were with mismatched pdftext/marker versions, which got cached in some regional pypi caches. Not sure why the caches didn't clear on update

@VikParuchuri
Copy link
Owner Author

Also @mara004 , thanks for all the comments! If you're interested in doing some paid maintainer work on these repos, please email me :)

@hector-sherpas
Copy link

hector-sherpas commented Oct 9, 2024

I did it this way, as says @mara004, so that it could be used from Marker, where the PdfDocument object is instantiated and then the dictionary_output function of pdftext is called, passing it the path of the document where the PdfDocument object is instantiated again. Therefore, what would have been done with the first instance of PdfDocument in Marker (e.g. preprocessing) is lost.

Also, I only put the str case because of the type of the pdftext argument in question.

def main():
    parser = argparse.ArgumentParser(description="Extract plain text from PDF.  Not guaranteed to be in order.")
    parser.add_argument("pdf_path", type=str, help="Path to the PDF file")
    parser.add_argument("--out_path", type=str, help="Path to the output text file, defaults to stdout", default=None)
    parser.add_argument("--json", action="store_true", help="Output json instead of plain text", default=False)
    parser.add_argument("--sort", action="store_true", help="Attempt to sort the text by reading order", default=False)
    parser.add_argument("--keep_hyphens", action="store_true", help="Keep hyphens in words", default=False)
    parser.add_argument("--pages", type=str, help="Comma separated pages to extract, like 1,2,3", default=None)
    parser.add_argument("--keep_chars", action="store_true", help="Keep character level information", default=False)
    parser.add_argument("--workers", type=int, help="Number of workers to use for parallel processing", default=None)
    args = parser.parse_args()

That's why I considered those two cases, argument of type str (already there) and I added that it could be a PdfDocument object. With the current change it can't be used from Marker as I said.

The options given by @mara004 allow you to expand the cases and use it from Marker in this way.

@mara004
Copy link
Contributor

mara004 commented Oct 9, 2024

Also @mara004 , thanks for all the comments! If you're interested in doing some paid maintainer work on these repos, please email me :)

Thanks a lot for the offer, but I think pypdfium2, its ctypesgen fork, and some other projects I'm trying to get off ground are more than enough work for me.1 Also, layout analysis / machine learning is not my field. pdftext/marker's internals are beyond my understanding and I could never have created what you have done here. All I can do is try to help a bit with pypdfium2 integration. Out of interest, I occasionally search for pypdfium2 issues in the bug trackers of popular embedders, and leave comments where I see fit.

That said, I'm glad you seem to perceive my comments as helpful rather than annoying. :)
Apologies in case I'm too noisy or my choice of words tends to sound rude at times.

Footnotes

  1. TBH, it has all grown a bit above my head, and I'm not sure how long I can continue maintaining the stuff. I'm sort of looking for a successor...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants