Skip to content

Commit

Permalink
Merge branch 'master' into block-display-tool-pr
Browse files Browse the repository at this point in the history
  • Loading branch information
jazzido authored Jan 27, 2025
2 parents 01626db + 6d58e82 commit a5865b4
Show file tree
Hide file tree
Showing 194 changed files with 146,544 additions and 46,420 deletions.
31 changes: 31 additions & 0 deletions .github/workflows/scripts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Test CLI scripts

on: [push]

env:
TORCH_DEVICE: "cpu"
OCR_ENGINE: "surya"

jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.11
uses: actions/setup-python@v4
with:
python-version: 3.11
- name: Install python dependencies
run: |
pip install poetry
poetry install
- name: Download benchmark data
run: |
wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1NHrdYatR1rtqs2gPVfdvO0BAvocH8CJi"
unzip -o benchmark_data.zip
- name: Test single script
run: poetry run marker_single benchmark_data/pdfs/switch_trans.pdf --page_range 0
- name: Test convert script
run: poetry run marker benchmark_data/pdfs --max_files 1 --workers 1 --page_range 0
- name: Text convert script multiple workers
run: poetry run marker benchmark_data/pdfs --max_files 2 --workers 2 --page_range 0-5
78 changes: 64 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
# Marker

Marker converts PDFs to markdown, JSON, and HTML quickly and accurately.
Marker converts PDFs and images to markdown, JSON, and HTML quickly and accurately.

- Supports a wide range of documents
- Supports all languages
- Removes headers/footers/other artifacts
- Formats tables, forms, and code blocks
- Supports a range of documents in all languages
- Formats tables, forms, equations, links, references, and code blocks
- Extracts and saves images along with the markdown
- Converts equations to latex
- Removes headers/footers/other artifacts
- Easily extensible with your own formatting and logic
- Optionally boost accuracy with an LLM
- Works on GPU, CPU, or MPS
Expand All @@ -18,7 +16,7 @@ Marker is a pipeline of deep learning models:

- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify). [tabled](https://github.com/VikParuchuri/tabled))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
- Optionally use an LLM to improve quality
- Combine blocks and postprocess complete text

Expand Down Expand Up @@ -63,11 +61,11 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

- Marker will only convert block equations
- Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
- Tables are not always formatted 100% correctly
- Forms are not converted optimally
- Very complex layouts, with nested tables and forms, may not work

Note: Passing the `--use_llm` flag will mostly solve all of these issues.
Note: Passing the `--use_llm` flag will mostly solve these issues.

# Installation

Expand All @@ -84,7 +82,7 @@ pip install marker-pdf
First, some configuration:

- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
- Some PDFs, even digital ones, have bad text in them. Set the `force_ocr` flag on the CLI or via configuration to ensure your PDF runs through OCR.
- Some PDFs, even digital ones, have bad text in them. Set the `force_ocr` flag on the CLI or via configuration to ensure your PDF runs through OCR, or the `strip_existing_ocr` to keep all digital text, and only strip out any existing OCR text.

## Interactive App

Expand All @@ -101,9 +99,12 @@ marker_gui
marker_single /path/to/file.pdf
```

You can pass in PDFs or images.

Options:
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
- `--output_format [markdown|json|html]`: Specify the format for the output results.
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
- `--use_llm`: Uses an LLM to improve accuracy. You must set your Gemini API key using the `GOOGLE_API_KEY` env var.
- `--disable_image_extraction`: Don't extract images from the PDF. If you also specify `--use_llm`, then images will be replaced with a description.
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
Expand All @@ -114,8 +115,9 @@ Options:
- `--config_json PATH`: Path to a JSON configuration file containing additional settings.
- `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "en,fr,de"` for English, French, and German.
- `config --help`: List all available builders, processors, and converters, and their associated configuration. These values can be used to build a JSON configuration file for additional tweaking of marker defaults.
- `--converter_cls`: One of `marker.converters.pdf.PdfConverter` (default) or `marker.converters.table.TableConverter`. The `PdfConverter` will convert the whole PDF, the `TableConverter` will only extract and convert tables.

The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you don't need OCR, marker can work with any language.
The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/recognition/languages.py). If you don't need OCR, marker can work with any language.

## Convert multiple files

Expand Down Expand Up @@ -179,7 +181,7 @@ rendered = converter("FILEPATH")

### Extract blocks

Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programatically manipulate these blocks.
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.

Here's an example of extracting all forms from a document:

Expand All @@ -197,6 +199,33 @@ forms = document.contained_blocks((BlockTypes.Form,))

Look at the processors for more examples of extracting and manipulating blocks.

## Other converters

You can also use other converters that define different conversion pipelines:

### Extract tables

The `TableConverter` will only convert and extract tables:

```python
from marker.converters.table import TableConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

converter = TableConverter(
artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")
text, _, images = text_from_rendered(rendered)
```

This takes all the same configuration as the PdfConverter. You can specify the configuration `--force_layout_block=Table` to avoid layout detection and instead assume every page is a table.

You can also run this via the CLI with
```shell
python convert_single.py FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter
```

# Output Formats

## Markdown
Expand Down Expand Up @@ -348,7 +377,7 @@ There are some settings that you may find useful if things aren't working the wa
Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.

# Benchmarks

## Overall PDF Conversion
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.

**Speed**
Expand All @@ -371,6 +400,18 @@ Marker takes about 6GB of VRAM on average per task, so you can convert 8 documen

![Benchmark results](data/images/per_doc.png)

## Table Conversion
Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:

| Avg score | Total tables | use_llm |
|-----------|--------------|---------|
| 0.822 | 54 | False |
| 0.887 | 54 | True |

The `--use_llm` flag can significantly improve table recognition performance, as you can see.

We filter out tables that we cannot align with the ground truth, since fintabnet and our layout model have slightly different detection methods (this results in some tables being split/merged).

## Running your own benchmarks

You can benchmark the performance of marker on your machine. Install marker manually with:
Expand All @@ -380,12 +421,21 @@ git clone https://github.com/VikParuchuri/marker.git
poetry install
```

### Overall PDF Conversion

Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:

```shell
python benchmarks/overall.py data/pdfs data/references report.json
```

### Table Conversion
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:

```shell
python benchmarks/table/table.py table_report.json --max_rows 1000
```

# Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):
Expand All @@ -395,4 +445,4 @@ This work would not have been possible without amazing open source models and da
- Pypdfium2/pdfium
- DocLayNet from IBM

Thank you to the authors of these models and datasets for making them available to the community!
Thank you to the authors of these models and datasets for making them available to the community!
49 changes: 49 additions & 0 deletions benchmarks/table/gemini.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import json
from PIL import Image
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types import content
from marker.settings import settings

prompt = """
You're an expert document analyst who is good at turning tables in documents into HTML. Analyze the provided image, and convert it to a faithful HTML representation.
Guidelines:
- Keep the HTML simple and concise.
- Only include the <table> tag and contents.
- Only use <table>, <tr>, and <td> tags. Only use the colspan and rowspan attributes if necessary. Do not use <tbody>, <thead>, or <th> tags.
- Make sure the table is as faithful to the image as possible with the given tags.
**Instructions**
1. Analyze the image, and determine the table structure.
2. Convert the table image to HTML, following the guidelines above.
3. Output only the HTML for the table, starting with the <table> tag and ending with the </table> tag.
""".strip()

genai.configure(api_key=settings.GOOGLE_API_KEY)

def gemini_table_rec(image: Image.Image):
schema = content.Schema(
type=content.Type.OBJECT,
required=["table_html"],
properties={
"table_html": content.Schema(
type=content.Type.STRING,
)
}
)

model = genai.GenerativeModel("gemini-1.5-flash")

responses = model.generate_content(
[image, prompt], # According to gemini docs, it performs better if the image is the first element
stream=False,
generation_config={
"temperature": 0,
"response_schema": schema,
"response_mime_type": "application/json",
},
request_options={'timeout': 60}
)

output = responses.candidates[0].content.parts[0].text
return json.loads(output)["table_html"]
109 changes: 109 additions & 0 deletions benchmarks/table/scoring.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
""""
TEDS Code Adapted from https://github.com/ibm-aur-nlp/EDD
"""

import distance
from apted import APTED, Config
from apted.helpers import Tree
from lxml import html
from collections import deque

def wrap_table_html(table_html:str)->str:
return f'<html><body>{table_html}</body></html>'

class TableTree(Tree):
def __init__(self, tag, colspan=None, rowspan=None, content=None, *children):
self.tag = tag
self.colspan = colspan
self.rowspan = rowspan
self.content = content

# Sets self.name and self.children
super().__init__(tag, *children)

def bracket(self):
"""Show tree using brackets notation"""
if self.tag == 'td':
result = '"tag": %s, "colspan": %d, "rowspan": %d, "text": %s' % \
(self.tag, self.colspan, self.rowspan, self.content)
else:
result = '"tag": %s' % self.tag
for child in self.children:
result += child.bracket()
return "{{{}}}".format(result)

class CustomConfig(Config):
@staticmethod
def maximum(*sequences):
return max(map(len, sequences))

def normalized_distance(self, *sequences):
return float(distance.levenshtein(*sequences)) / self.maximum(*sequences)

def rename(self, node1, node2):
if (node1.tag != node2.tag) or (node1.colspan != node2.colspan) or (node1.rowspan != node2.rowspan):
return 1.
if node1.tag == 'td':
if node1.content or node2.content:
return self.normalized_distance(node1.content, node2.content)
return 0.

def tokenize(node):
"""
Tokenizes table cells
"""
global __tokens__
__tokens__.append('<%s>' % node.tag)
if node.text is not None:
__tokens__ += list(node.text)
for n in node.getchildren():
tokenize(n)
if node.tag != 'unk':
__tokens__.append('</%s>' % node.tag)
if node.tag != 'td' and node.tail is not None:
__tokens__ += list(node.tail)

def tree_convert_html(node, convert_cell=False, parent=None):
"""
Converts HTML tree to the format required by apted
"""
global __tokens__
if node.tag == 'td':
if convert_cell:
__tokens__ = []
tokenize(node)
cell = __tokens__[1:-1].copy()
else:
cell = []
new_node = TableTree(node.tag,
int(node.attrib.get('colspan', '1')),
int(node.attrib.get('rowspan', '1')),
cell, *deque())
else:
new_node = TableTree(node.tag, None, None, None, *deque())
if parent is not None:
parent.children.append(new_node)
if node.tag != 'td':
for n in node.getchildren():
tree_convert_html(n, convert_cell, new_node)
if parent is None:
return new_node

def similarity_eval_html(pred, true, structure_only=False):
"""
Computes TEDS score between the prediction and the ground truth of a given samples
"""
pred, true = html.fromstring(pred), html.fromstring(true)
if pred.xpath('body/table') and true.xpath('body/table'):
pred = pred.xpath('body/table')[0]
true = true.xpath('body/table')[0]
n_nodes_pred = len(pred.xpath(".//*"))
n_nodes_true = len(true.xpath(".//*"))
tree_pred = tree_convert_html(pred, convert_cell=not structure_only)
tree_true = tree_convert_html(true, convert_cell=not structure_only)
n_nodes = max(n_nodes_pred, n_nodes_true)
distance = APTED(tree_pred, tree_true, CustomConfig()).compute_edit_distance()
return 1.0 - (float(distance) / n_nodes)
else:
return 0.0

Loading

0 comments on commit a5865b4

Please sign in to comment.