-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix document loading bug #10
Conversation
VikParuchuri
commented
Oct 8, 2024
- There was a bug where it was assumed that input paths would always be of str type - this is not always the case
I guess the reason @hector-sherpas did that is the many issues in marker's bug tracker with some caller(s) in pdftext apparently passing a However, this attempt allowed only string input. The right fix would be to check just for |
I think the issues there were with mismatched pdftext/marker versions, which got cached in some regional pypi caches. Not sure why the caches didn't clear on update |
Also @mara004 , thanks for all the comments! If you're interested in doing some paid maintainer work on these repos, please email me :) |
I did it this way, as says @mara004, so that it could be used from Marker, where the PdfDocument object is instantiated and then the dictionary_output function of pdftext is called, passing it the path of the document where the PdfDocument object is instantiated again. Therefore, what would have been done with the first instance of PdfDocument in Marker (e.g. preprocessing) is lost. Also, I only put the str case because of the type of the pdftext argument in question. def main():
parser = argparse.ArgumentParser(description="Extract plain text from PDF. Not guaranteed to be in order.")
parser.add_argument("pdf_path", type=str, help="Path to the PDF file")
parser.add_argument("--out_path", type=str, help="Path to the output text file, defaults to stdout", default=None)
parser.add_argument("--json", action="store_true", help="Output json instead of plain text", default=False)
parser.add_argument("--sort", action="store_true", help="Attempt to sort the text by reading order", default=False)
parser.add_argument("--keep_hyphens", action="store_true", help="Keep hyphens in words", default=False)
parser.add_argument("--pages", type=str, help="Comma separated pages to extract, like 1,2,3", default=None)
parser.add_argument("--keep_chars", action="store_true", help="Keep character level information", default=False)
parser.add_argument("--workers", type=int, help="Number of workers to use for parallel processing", default=None)
args = parser.parse_args() That's why I considered those two cases, argument of type str (already there) and I added that it could be a PdfDocument object. With the current change it can't be used from Marker as I said. The options given by @mara004 allow you to expand the cases and use it from Marker in this way. |
Thanks a lot for the offer, but I think pypdfium2, its ctypesgen fork, and some other projects I'm trying to get off ground are more than enough work for me.1 Also, layout analysis / machine learning is not my field. pdftext/marker's internals are beyond my understanding and I could never have created what you have done here. All I can do is try to help a bit with pypdfium2 integration. Out of interest, I occasionally search for pypdfium2 issues in the bug trackers of popular embedders, and leave comments where I see fit. That said, I'm glad you seem to perceive my comments as helpful rather than annoying. :) Footnotes
|