VikParuchuri · VikParuchuri · Feb 11, 2025 · Jan 29, 2025 · Jan 30, 2025 · Jan 30, 2025
diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
diff --git a/.github/workflows/benchmarks.yml b/.github/workflows/benchmarks.yml
@@ -0,0 +1,32 @@
+name: Integration test
+
+on: [push]
+
+env:
+  PYTHONIOENCODING: "utf-8"
+
+jobs:
+  benchmark:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.11
+      - name: Install apt dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y pandoc
+      - name: Install python dependencies
+        run: |
+          pip install poetry
+          poetry install
+      - name: Run benchmark test
+        run: |
+          poetry run python benchmarks/overall/overall.py --max_rows 5
+          poetry run python benchmarks/verify_scores.py conversion_results/benchmark/overall/result.json --type marker
+      - name: Run table benchmark
+        run: |
+          poetry run python benchmarks/table/table.py --max_rows 5
+          poetry run python benchmarks/verify_scores.py conversion_results/benchmark/table/table.json --type table
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -2,10 +2,6 @@ name: CI tests
 
 on: [push]
 
-env:
-  TORCH_DEVICE: "cpu"
-  OCR_ENGINE: "surya"
-
 jobs:
   tests:
     runs-on: ubuntu-latest

diff --git a/.github/workflows/scripts.yml b/.github/workflows/scripts.yml
@@ -2,10 +2,6 @@ name: Test CLI scripts
 
 on: [push]
 
-env:
-  TORCH_DEVICE: "cpu"
-  OCR_ENGINE: "surya"
-
 jobs:
   tests:
     runs-on: ubuntu-latest

diff --git a/README.md b/README.md
@@ -10,17 +10,25 @@ Marker converts PDFs and images to markdown, JSON, and HTML quickly and accurate
 - Optionally boost accuracy with an LLM
 - Works on GPU, CPU, or MPS
 
-## How it works
+## Performance
 
-Marker is a pipeline of deep learning models:
+<img src="data/images/overall.png" width="800px"/>
 
-- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
-- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
-- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
-- Optionally use an LLM to improve quality
-- Combine blocks and postprocess complete text
+Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
 
-It only uses models where necessary, which improves speed and accuracy.
+The above results are running single PDF pages serially.  Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
+
+See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
+
+## Hybrid Mode
+
+For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker.  This will do things like merge tables across pages, format tables properly, and extract values from forms.  It uses `gemini-flash-2.0`, which is cheap and fast.
+
+Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:
+
+<img src="data/images/table.png" width="400px"/>
+
+As you can see, the use_llm mode offers higher accuracy than marker or gemini alone.
 
 ## Examples
 
@@ -30,14 +38,6 @@ It only uses models where necessary, which improves speed and accuracy.
 | [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers/switch_trans.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_trans.json) |
 | [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn/multicolcnn.md)                 | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.json)         |
 
-## Performance
-
-![Benchmark overall](data/images/overall.png)
-
-The above results are with marker setup so it takes ~7GB of VRAM on an A10.
-
-See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
-
 # Commercial usage
 
 I want marker to be as widely accessible as possible, while still funding my development/training costs.  Research and personal usage is always okay, but there are some restrictions on commercial usage.
@@ -56,17 +56,6 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
 
 [Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
 
-# Limitations
-
-PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
-
-- Marker will only convert block equations
-- Tables are not always formatted 100% correctly
-- Forms are not converted optimally
-- Very complex layouts, with nested tables and forms, may not work
-
-Note: Passing the `--use_llm` flag will mostly solve these issues.
-
 # Installation
 
 You'll need python 3.10+ and PyTorch.  You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine.  See [here](https://pytorch.org/get-started/locally/) for more details.
@@ -82,7 +71,7 @@ pip install marker-pdf
 First, some configuration:
 
 - Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
-- Some PDFs, even digital ones, have bad text in them.  Set the `force_ocr` flag on the CLI or via configuration to ensure your PDF runs through OCR, or the `strip_existing_ocr` to keep all digital text, and only strip out any existing OCR text.
+- Some PDFs, even digital ones, have bad text in them.  Set the `force_ocr` flag to ensure your PDF runs through OCR, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
 
 ## Interactive App
 
@@ -219,11 +208,11 @@ rendered = converter("FILEPATH")
 text, _, images = text_from_rendered(rendered)
 ```
 
-This takes all the same configuration as the PdfConverter.  You can specify the configuration `--force_layout_block=Table` to avoid layout detection and instead assume every page is a table.
+This takes all the same configuration as the PdfConverter.  You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table.  Set `output_format=json` to also get cell bounding boxes.
 
 You can also run this via the CLI with 
 ```shell
-python convert_single.py FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter
+marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
 ```
 
 # Output Formats
@@ -377,36 +366,55 @@ There are some settings that you may find useful if things aren't working the wa
 Pass the `debug` option to activate debug mode.  This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
 
 # Benchmarks
+
 ## Overall PDF Conversion
-Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.  It's noisy, but at least directionally correct.
 
-**Speed**
+We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.  We scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.
+
+| Method     | Avg Time | Heuristic Score | LLM Score |
+|------------|----------|-----------------|-----------|
+| marker     | 2.83837  | 95.6709         | 4.23916   |
+| llamaparse | 23.348   | 84.2442         | 3.97619   |
+| mathpix    | 6.36223  | 86.4281         | 4.15626   |
+| docling    | 3.69949  | 86.7073         | 3.70429   |
 
-| Method  | Average Score | Time per page | Time per document |
-|---------|----------------|---------------|------------------|
-| marker  | 0.625115       | 0.234184     | 21.545           |
+Benchmarks were run on an H100 for markjer and docling - llamaparse and mathpix used their cloud services.  We can also look at it by document type:
 
-**Accuracy**
+<img src="data/images/per_doc.png" width="1000px"/>
 
-| Method  | thinkpython.pdf | switch_trans.pdf | thinkdsp.pdf | crowd.pdf | thinkos.pdf | multicolcnn.pdf |
-|---------|----------------|-----------------|--------------|------------|-------------|----------------|
-| marker  | 0.720347       | 0.592002       | 0.70468     | 0.515082   | 0.701394    | 0.517184      |
+| Document Type        | Marker heuristic | Marker LLM | Llamaparse Heuristic | Llamaparse LLM | Mathpix Heuristic | Mathpix LLM | Docling Heuristic | Docling LLM |
+|----------------------|------------------|------------|----------------------|----------------|-------------------|-------------|-------------------|-------------|
+| Scientific paper     | 96.6737          | 4.34899    | 87.1651              | 3.96421        | 91.2267           | 4.46861     | 92.135            | 3.72422     |
+| Book page            | 97.1846          | 4.16168    | 90.9532              | 4.07186        | 93.8886           | 4.35329     | 90.0556           | 3.64671     |
+| Other                | 95.1632          | 4.25076    | 81.1385              | 4.01835        | 79.6231           | 4.00306     | 83.8223           | 3.76147     |
+| Form                 | 88.0147          | 3.84663    | 66.3081              | 3.68712        | 64.7512           | 3.33129     | 68.3857           | 3.40491     |
+| Presentation         | 95.1562          | 4.13669    | 81.2261              | 4              | 83.6737           | 3.95683     | 84.8405           | 3.86331     |
+| Financial document   | 95.3697          | 4.39106    | 82.5812              | 4.16111        | 81.3115           | 4.05556     | 86.3882           | 3.8         |
+| Letter               | 98.4021          | 4.5        | 93.4477              | 4.28125        | 96.0383           | 4.45312     | 92.0952           | 4.09375     |
+| Engineering document | 93.9244          | 4.04412    | 77.4854              | 3.72059        | 80.3319           | 3.88235     | 79.6807           | 3.42647     |
+| Legal document       | 96.689           | 4.27759    | 86.9769              | 3.87584        | 91.601            | 4.20805     | 87.8383           | 3.65552     |
+| Newspaper page       | 98.8733          | 4.25806    | 84.7492              | 3.90323        | 96.9963           | 4.45161     | 92.6496           | 3.51613     |
+| Magazine page        | 98.2145          | 4.38776    | 87.2902              | 3.97959        | 93.5934           | 4.16327     | 93.0892           | 4.02041     |
 
-Peak GPU memory usage during the benchmark is `6GB` for marker.  Benchmarks were run on an A10.
+## Throughput
 
-**Throughput**
+We benchmarked throughput using a [single long PDF](https://www.greenteapress.com/thinkpython/thinkpython.pdf).
 
-Marker takes about 6GB of VRAM on average per task, so you can convert 8 documents in parallel on an A6000.
+| Method  | Time per page | Time per document | VRAM used |
+|---------|---------------|-------------------|---------- |
+| marker  | 0.18          | 43.42             |  3.17GB   |
 
-![Benchmark results](data/images/per_doc.png)
+The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes given the VRAM used.
 
 ## Table Conversion
+
 Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
 
-| Avg score | Total tables | use_llm |
-|-----------|--------------|---------|
-| 0.822     | 54           | False   |
-| 0.887     | 54           | True    |
+| Method           | Avg score | Total tables |
+|------------------|-----------|--------------|
+| marker           | 0.816     | 99           |
+| marker w/use_llm | 0.907     | 99           |
+| gemini           | 0.829     | 99           |
 
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.
 
@@ -426,16 +434,49 @@ poetry install
 Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
 
 ```shell
-python benchmarks/overall.py data/pdfs data/references report.json
+python benchmarks/overall.py --methods marker --scores heuristic,llm
 ```
 
+Options:
+
+- `--use_llm` use an llm to improve the marker results.
+- `--max_rows` how many rows to process for the benchmark.
+- `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`.  Comma separated.
+- `--scores` which scoring functions to use, can be `llm`, `heuristic`.  Comma separated.
+
 ### Table Conversion
 The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
 
 ```shell
-python benchmarks/table/table.py table_report.json --max_rows 1000
+python benchmarks/table/table.py --max_rows 100
 ```
 
+Options:
+
+- `--use_llm` uses an llm with marker to improve accuracy.
+- `--use_gemini` also benchmarks gemini 2.0 flash.
+
+# How it works
+
+Marker is a pipeline of deep learning models:
+
+- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
+- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
+- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
+- Optionally use an LLM to improve quality
+- Combine blocks and postprocess complete text
+
+It only uses models where necessary, which improves speed and accuracy.
+
+# Limitations
+
+PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
+
+- Marker will only convert block equations
+- Very complex layouts, with nested tables and forms, may not work
+
+Note: Passing the `--use_llm` flag will mostly solve these issues.
+
 # Thanks
 
 This work would not have been possible without amazing open source models and datasets, including (but not limited to):
@@ -445,4 +486,4 @@ This work would not have been possible without amazing open source models and da
 - Pypdfium2/pdfium
 - DocLayNet from IBM
 
-Thank you to the authors of these models and datasets for making them available to the community!
+Thank you to the authors of these models and datasets for making them available to the community!
diff --git a/benchmarks/__init__.py b/benchmarks/__init__.py