Problem detecting columns / VRAM Requirements GPU acceleration #110

tilllt · 2024-05-04T21:01:06Z

Hey there,

I am trying to convert election position papers of various parties for the upcoming european elections, to further process them in an LLM, for an educational project.

After several unsuccessful attempts using ChatGPT and other tools to convert the PDF documents into text only - or markdown - I found your tool and the result is pretty good.

Unfortunately it gets confused by the column reading order occasionally, so paragraphs pop up in completely unrelated sections of the document. Is there anything I can do to optimize the process? The document is really too long to detect and fix all errors manually...

This is the example I am talking about:
https://cms.gruene.de/uploads/assets/20240306_Reader_EU-Wahlprogramm2024_A4.pdf

Thanks

VikParuchuri · 2024-05-05T21:16:15Z

I'm working on a new version (still WIP) that should fix this. Try the dev branch and see how it does.

VikParuchuri · 2024-05-09T20:37:53Z

Should be fixed by this - #116

tilllt · 2024-05-11T07:15:37Z

I tried the dev version after you mentioned it, it went through the linked example without errors, thanks for the great work.

Additionally I fiddled a bit with the CPU / CUDA torch settings, but it seems as if my ancient gtx1060 with 6GB VRAM can not be useful for accelerating your tool? Torch was always complaining about the lack of memory, it seems like the models needed by marker cumulate to 5.6gb VRAM requirement, so apparently more then my gtx1060 can provide in reality.

I went back to CPU processing and later encountered problems with some documents, where the analysis started, went through to about 75% and then abruptly ended before the document was finished. There was no error message to why it stopped processing.

As I said, that was still in the dev branch, I will try your new v2 version to see if that works better today and report back.

VikParuchuri · 2024-05-12T02:12:21Z

Let me know if you see the "things silently failing" issue again - and please send the PDFs if possible. I think there is a memory leak with certain kinds of PDFs, but haven't been able to track it down. OOM would match what you're describing

tilllt · 2024-05-13T05:32:42Z

Let me know if you see the "things silently failing" issue again

Unfortunately I did see it again after I switched from dev to the main v2 branch again, simultaneously switching back to CPU processing and disabling CUDA.

The document I was running on was this:

https://voltdeutschland.org/storage/assets-de/pdf/europawahl_2024/volt-wahlprogramm-europawahl-2024.pdf

VikParuchuri · 2024-05-16T21:46:23Z

I think I fixed this, but needs to be merged and tested end to end - VikParuchuri/surya#103

VikParuchuri closed this as completed May 9, 2024

tilllt changed the title ~~Problem detecting columns~~ Problem detecting columns / VRAM Requirements GPU acceleration May 11, 2024

wciq1208 mentioned this issue Jul 9, 2024

Crashed in a multi-threaded environment #225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem detecting columns / VRAM Requirements GPU acceleration #110

Problem detecting columns / VRAM Requirements GPU acceleration #110

tilllt commented May 4, 2024 •

edited

Loading

VikParuchuri commented May 5, 2024

VikParuchuri commented May 9, 2024

tilllt commented May 11, 2024 •

edited

Loading

VikParuchuri commented May 12, 2024

tilllt commented May 13, 2024

VikParuchuri commented May 16, 2024

Problem detecting columns / VRAM Requirements GPU acceleration #110

Problem detecting columns / VRAM Requirements GPU acceleration #110

Comments

tilllt commented May 4, 2024 • edited Loading

VikParuchuri commented May 5, 2024

VikParuchuri commented May 9, 2024

tilllt commented May 11, 2024 • edited Loading

VikParuchuri commented May 12, 2024

tilllt commented May 13, 2024

VikParuchuri commented May 16, 2024

tilllt commented May 4, 2024 •

edited

Loading

tilllt commented May 11, 2024 •

edited

Loading