Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks #531

Merged
merged 31 commits into from
Feb 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
95c06c8
Update overall benchmark
VikParuchuri Jan 29, 2025
e6e2d7d
Clean up benchmarks
VikParuchuri Jan 30, 2025
70c0b0e
Additional benchmark cleanup
VikParuchuri Jan 30, 2025
bbf4161
Refactor benchmarks
VikParuchuri Jan 30, 2025
9a8da13
Additional fixes
VikParuchuri Jan 30, 2025
720f09a
Bump surya
VikParuchuri Jan 30, 2025
d487f46
Improve bench
VikParuchuri Jan 31, 2025
cfde6d6
add llm text support for references, superscripts etc
iammosespaulr Feb 1, 2025
225ff44
fix typo [skip ci]
iammosespaulr Feb 1, 2025
93deddd
refine prompt [skip ci]
iammosespaulr Feb 1, 2025
4e0fadc
fix llm table merging error
iammosespaulr Feb 3, 2025
277f2db
Add order processor
VikParuchuri Feb 3, 2025
f1f93aa
Add pandoc
VikParuchuri Feb 3, 2025
805d200
Clean up benchmark, make more pluggable
VikParuchuri Feb 4, 2025
75633ca
Finalize dataset uploading
VikParuchuri Feb 4, 2025
d49df6c
Benchmark fixes
VikParuchuri Feb 6, 2025
de9651e
Cleanup texify integration
VikParuchuri Feb 6, 2025
dac0f79
Update property name
VikParuchuri Feb 6, 2025
4ceab29
Bump to newer google client lib
VikParuchuri Feb 6, 2025
364525d
Remove old import
VikParuchuri Feb 6, 2025
e875e5e
Test fixes
VikParuchuri Feb 6, 2025
8534671
Merge pull request #523 from VikParuchuri/dev-mose/add-llmtext-suppor…
VikParuchuri Feb 7, 2025
3d4807a
Add back llm text processor
VikParuchuri Feb 7, 2025
0dab2ce
Merge dev
VikParuchuri Feb 7, 2025
84240c4
Merge pull request #515 from VikParuchuri/vik_bench
VikParuchuri Feb 7, 2025
358b163
More bench options
VikParuchuri Feb 9, 2025
cd98de1
Additional benchmark types
VikParuchuri Feb 10, 2025
cc88b74
Add elo ratings
VikParuchuri Feb 10, 2025
9e42477
README updates
VikParuchuri Feb 11, 2025
7dfafef
Update benchmarks
VikParuchuri Feb 11, 2025
264ed41
Update README
VikParuchuri Feb 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 0 additions & 33 deletions .github/workflows/benchmark.yml

This file was deleted.

32 changes: 32 additions & 0 deletions .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Integration test

on: [push]

env:
PYTHONIOENCODING: "utf-8"

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.11
uses: actions/setup-python@v4
with:
python-version: 3.11
- name: Install apt dependencies
run: |
sudo apt-get update
sudo apt-get install -y pandoc
- name: Install python dependencies
run: |
pip install poetry
poetry install
- name: Run benchmark test
run: |
poetry run python benchmarks/overall/overall.py --max_rows 5
poetry run python benchmarks/verify_scores.py conversion_results/benchmark/overall/result.json --type marker
- name: Run table benchmark
run: |
poetry run python benchmarks/table/table.py --max_rows 5
poetry run python benchmarks/verify_scores.py conversion_results/benchmark/table/table.json --type table
4 changes: 0 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,6 @@ name: CI tests

on: [push]

env:
TORCH_DEVICE: "cpu"
OCR_ENGINE: "surya"

jobs:
tests:
runs-on: ubuntu-latest
Expand Down
4 changes: 0 additions & 4 deletions .github/workflows/scripts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,6 @@ name: Test CLI scripts

on: [push]

env:
TORCH_DEVICE: "cpu"
OCR_ENGINE: "surya"

jobs:
tests:
runs-on: ubuntu-latest
Expand Down
141 changes: 91 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,25 @@ Marker converts PDFs and images to markdown, JSON, and HTML quickly and accurate
- Optionally boost accuracy with an LLM
- Works on GPU, CPU, or MPS

## How it works
## Performance

Marker is a pipeline of deep learning models:
<img src="data/images/overall.png" width="800px"/>

- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
- Optionally use an LLM to improve quality
- Combine blocks and postprocess complete text
Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.

It only uses models where necessary, which improves speed and accuracy.
The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).

See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

## Hybrid Mode

For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker. This will do things like merge tables across pages, format tables properly, and extract values from forms. It uses `gemini-flash-2.0`, which is cheap and fast.

Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:

<img src="data/images/table.png" width="400px"/>

As you can see, the use_llm mode offers higher accuracy than marker or gemini alone.

## Examples

Expand All @@ -30,14 +38,6 @@ It only uses models where necessary, which improves speed and accuracy.
| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers/switch_trans.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_trans.json) |
| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.json) |

## Performance

![Benchmark overall](data/images/overall.png)

The above results are with marker setup so it takes ~7GB of VRAM on an A10.

See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

# Commercial usage

I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
Expand All @@ -56,17 +56,6 @@ There's a hosted API for marker available [here](https://www.datalab.to/):

[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.

# Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

- Marker will only convert block equations
- Tables are not always formatted 100% correctly
- Forms are not converted optimally
- Very complex layouts, with nested tables and forms, may not work

Note: Passing the `--use_llm` flag will mostly solve these issues.

# Installation

You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
Expand All @@ -82,7 +71,7 @@ pip install marker-pdf
First, some configuration:

- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
- Some PDFs, even digital ones, have bad text in them. Set the `force_ocr` flag on the CLI or via configuration to ensure your PDF runs through OCR, or the `strip_existing_ocr` to keep all digital text, and only strip out any existing OCR text.
- Some PDFs, even digital ones, have bad text in them. Set the `force_ocr` flag to ensure your PDF runs through OCR, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.

## Interactive App

Expand Down Expand Up @@ -219,11 +208,11 @@ rendered = converter("FILEPATH")
text, _, images = text_from_rendered(rendered)
```

This takes all the same configuration as the PdfConverter. You can specify the configuration `--force_layout_block=Table` to avoid layout detection and instead assume every page is a table.
This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.

You can also run this via the CLI with
```shell
python convert_single.py FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter
marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
```

# Output Formats
Expand Down Expand Up @@ -377,36 +366,55 @@ There are some settings that you may find useful if things aren't working the wa
Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.

# Benchmarks

## Overall PDF Conversion
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.

**Speed**
We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl. We scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.

| Method | Avg Time | Heuristic Score | LLM Score |
|------------|----------|-----------------|-----------|
| marker | 2.83837 | 95.6709 | 4.23916 |
| llamaparse | 23.348 | 84.2442 | 3.97619 |
| mathpix | 6.36223 | 86.4281 | 4.15626 |
| docling | 3.69949 | 86.7073 | 3.70429 |

| Method | Average Score | Time per page | Time per document |
|---------|----------------|---------------|------------------|
| marker | 0.625115 | 0.234184 | 21.545 |
Benchmarks were run on an H100 for markjer and docling - llamaparse and mathpix used their cloud services. We can also look at it by document type:

**Accuracy**
<img src="data/images/per_doc.png" width="1000px"/>

| Method | thinkpython.pdf | switch_trans.pdf | thinkdsp.pdf | crowd.pdf | thinkos.pdf | multicolcnn.pdf |
|---------|----------------|-----------------|--------------|------------|-------------|----------------|
| marker | 0.720347 | 0.592002 | 0.70468 | 0.515082 | 0.701394 | 0.517184 |
| Document Type | Marker heuristic | Marker LLM | Llamaparse Heuristic | Llamaparse LLM | Mathpix Heuristic | Mathpix LLM | Docling Heuristic | Docling LLM |
|----------------------|------------------|------------|----------------------|----------------|-------------------|-------------|-------------------|-------------|
| Scientific paper | 96.6737 | 4.34899 | 87.1651 | 3.96421 | 91.2267 | 4.46861 | 92.135 | 3.72422 |
| Book page | 97.1846 | 4.16168 | 90.9532 | 4.07186 | 93.8886 | 4.35329 | 90.0556 | 3.64671 |
| Other | 95.1632 | 4.25076 | 81.1385 | 4.01835 | 79.6231 | 4.00306 | 83.8223 | 3.76147 |
| Form | 88.0147 | 3.84663 | 66.3081 | 3.68712 | 64.7512 | 3.33129 | 68.3857 | 3.40491 |
| Presentation | 95.1562 | 4.13669 | 81.2261 | 4 | 83.6737 | 3.95683 | 84.8405 | 3.86331 |
| Financial document | 95.3697 | 4.39106 | 82.5812 | 4.16111 | 81.3115 | 4.05556 | 86.3882 | 3.8 |
| Letter | 98.4021 | 4.5 | 93.4477 | 4.28125 | 96.0383 | 4.45312 | 92.0952 | 4.09375 |
| Engineering document | 93.9244 | 4.04412 | 77.4854 | 3.72059 | 80.3319 | 3.88235 | 79.6807 | 3.42647 |
| Legal document | 96.689 | 4.27759 | 86.9769 | 3.87584 | 91.601 | 4.20805 | 87.8383 | 3.65552 |
| Newspaper page | 98.8733 | 4.25806 | 84.7492 | 3.90323 | 96.9963 | 4.45161 | 92.6496 | 3.51613 |
| Magazine page | 98.2145 | 4.38776 | 87.2902 | 3.97959 | 93.5934 | 4.16327 | 93.0892 | 4.02041 |

Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
## Throughput

**Throughput**
We benchmarked throughput using a [single long PDF](https://www.greenteapress.com/thinkpython/thinkpython.pdf).

Marker takes about 6GB of VRAM on average per task, so you can convert 8 documents in parallel on an A6000.
| Method | Time per page | Time per document | VRAM used |
|---------|---------------|-------------------|---------- |
| marker | 0.18 | 43.42 | 3.17GB |

![Benchmark results](data/images/per_doc.png)
The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes given the VRAM used.

## Table Conversion

Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:

| Avg score | Total tables | use_llm |
|-----------|--------------|---------|
| 0.822 | 54 | False |
| 0.887 | 54 | True |
| Method | Avg score | Total tables |
|------------------|-----------|--------------|
| marker | 0.816 | 99 |
| marker w/use_llm | 0.907 | 99 |
| gemini | 0.829 | 99 |

The `--use_llm` flag can significantly improve table recognition performance, as you can see.

Expand All @@ -426,16 +434,49 @@ poetry install
Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:

```shell
python benchmarks/overall.py data/pdfs data/references report.json
python benchmarks/overall.py --methods marker --scores heuristic,llm
```

Options:

- `--use_llm` use an llm to improve the marker results.
- `--max_rows` how many rows to process for the benchmark.
- `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`. Comma separated.
- `--scores` which scoring functions to use, can be `llm`, `heuristic`. Comma separated.

### Table Conversion
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:

```shell
python benchmarks/table/table.py table_report.json --max_rows 1000
python benchmarks/table/table.py --max_rows 100
```

Options:

- `--use_llm` uses an llm with marker to improve accuracy.
- `--use_gemini` also benchmarks gemini 2.0 flash.

# How it works

Marker is a pipeline of deep learning models:

- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
- Optionally use an LLM to improve quality
- Combine blocks and postprocess complete text

It only uses models where necessary, which improves speed and accuracy.

# Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

- Marker will only convert block equations
- Very complex layouts, with nested tables and forms, may not work

Note: Passing the `--use_llm` flag will mostly solve these issues.

# Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):
Expand All @@ -445,4 +486,4 @@ This work would not have been possible without amazing open source models and da
- Pypdfium2/pdfium
- DocLayNet from IBM

Thank you to the authors of these models and datasets for making them available to the community!
Thank you to the authors of these models and datasets for making them available to the community!
Empty file added benchmarks/__init__.py
Empty file.
Loading