Skip to content

Commit

Permalink
Merge pull request #503 from VikParuchuri/dev
Browse files Browse the repository at this point in the history
Table improvements
  • Loading branch information
VikParuchuri authored Jan 24, 2025
2 parents 0569b1a + 989c697 commit 8a2a845
Show file tree
Hide file tree
Showing 102 changed files with 142,634 additions and 44,921 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Marker converts PDFs and images to markdown, JSON, and HTML quickly and accurately.

- Supports a range of documents in all languages
- Removes headers/footers/other artifacts
- Formats tables, forms, equations, links, and code blocks
- Formats tables, forms, equations, links, references, and code blocks
- Extracts and saves images along with the markdown
- Removes headers/footers/other artifacts
- Easily extensible with your own formatting and logic
- Optionally boost accuracy with an LLM
- Works on GPU, CPU, or MPS
Expand All @@ -16,7 +16,7 @@ Marker is a pipeline of deep learning models:

- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify). [tabled](https://github.com/VikParuchuri/tabled))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
- Optionally use an LLM to improve quality
- Combine blocks and postprocess complete text

Expand Down
1 change: 1 addition & 0 deletions benchmarks/table/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ def main(
for th_tag in marker_table_soup.find_all('th'):
th_tag.name = 'td'
marker_table_html = str(marker_table_soup)
marker_table_html = marker_table_html.replace("<br>", " ") # Fintabnet uses spaces instead of newlines
marker_table_html = marker_table_html.replace("\n", " ") # Fintabnet uses spaces instead of newlines
gemini_table_html = gemini_table.replace("\n", " ") # Fintabnet uses spaces instead of newlines

Expand Down
Loading

0 comments on commit 8a2a845

Please sign in to comment.