Merge branch 'master' into block-display-tool-pr

VikParuchuri · Jan 27, 2025 · a5865b4 · a5865b4
2 parents 01626db + 6d58e82
commit a5865b4
Show file tree

Hide file tree

Showing 194 changed files with 146,544 additions and 46,420 deletions.
diff --git a/.github/workflows/scripts.yml b/.github/workflows/scripts.yml
@@ -0,0 +1,31 @@
+name: Test CLI scripts
+
+on: [push]
+
+env:
+  TORCH_DEVICE: "cpu"
+  OCR_ENGINE: "surya"
+
+jobs:
+  tests:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.11
+      - name: Install python dependencies
+        run: |
+          pip install poetry
+          poetry install
+      - name: Download benchmark data
+        run: |
+          wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1NHrdYatR1rtqs2gPVfdvO0BAvocH8CJi"
+          unzip -o benchmark_data.zip
+      - name: Test single script
+        run: poetry run marker_single benchmark_data/pdfs/switch_trans.pdf --page_range 0
+      - name: Test convert script
+        run: poetry run marker benchmark_data/pdfs --max_files 1 --workers 1 --page_range 0
+      - name: Text convert script multiple workers
+        run: poetry run marker benchmark_data/pdfs --max_files 2 --workers 2 --page_range 0-5
diff --git a/README.md b/README.md
@@ -1,13 +1,11 @@
 # Marker
 
-Marker converts PDFs to markdown, JSON, and HTML quickly and accurately.
+Marker converts PDFs and images to markdown, JSON, and HTML quickly and accurately.
 
-- Supports a wide range of documents
-- Supports all languages
-- Removes headers/footers/other artifacts
-- Formats tables, forms, and code blocks
+- Supports a range of documents in all languages
+- Formats tables, forms, equations, links, references, and code blocks
 - Extracts and saves images along with the markdown
-- Converts equations to latex
+- Removes headers/footers/other artifacts
 - Easily extensible with your own formatting and logic
 - Optionally boost accuracy with an LLM
 - Works on GPU, CPU, or MPS
@@ -18,7 +16,7 @@ Marker is a pipeline of deep learning models:
 
 - Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
 - Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
-- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify). [tabled](https://github.com/VikParuchuri/tabled))
+- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify), [surya](https://github.com/VikParuchuri/surya))
 - Optionally use an LLM to improve quality
 - Combine blocks and postprocess complete text
 
@@ -63,11 +61,11 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
 PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
 
 - Marker will only convert block equations
-- Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
+- Tables are not always formatted 100% correctly
 - Forms are not converted optimally
 - Very complex layouts, with nested tables and forms, may not work
 
-Note: Passing the `--use_llm` flag will mostly solve all of these issues.
+Note: Passing the `--use_llm` flag will mostly solve these issues.
 
 # Installation
 
@@ -84,7 +82,7 @@ pip install marker-pdf
 First, some configuration:
 
 - Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
-- Some PDFs, even digital ones, have bad text in them.  Set the `force_ocr` flag on the CLI or via configuration to ensure your PDF runs through OCR.
+- Some PDFs, even digital ones, have bad text in them.  Set the `force_ocr` flag on the CLI or via configuration to ensure your PDF runs through OCR, or the `strip_existing_ocr` to keep all digital text, and only strip out any existing OCR text.
 
 ## Interactive App
 
@@ -101,9 +99,12 @@ marker_gui
 marker_single /path/to/file.pdf
 ```
 
+You can pass in PDFs or images.
+
 Options:
 - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
 - `--output_format [markdown|json|html]`: Specify the format for the output results.
+- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n` 
 - `--use_llm`: Uses an LLM to improve accuracy.  You must set your Gemini API key using the `GOOGLE_API_KEY` env var.
 - `--disable_image_extraction`: Don't extract images from the PDF.  If you also specify `--use_llm`, then images will be replaced with a description.
 - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
@@ -114,8 +115,9 @@ Options:
 - `--config_json PATH`: Path to a JSON configuration file containing additional settings.
 - `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "en,fr,de"` for English, French, and German.
 - `config --help`: List all available builders, processors, and converters, and their associated configuration.  These values can be used to build a JSON configuration file for additional tweaking of marker defaults.
+- `--converter_cls`: One of `marker.converters.pdf.PdfConverter` (default) or `marker.converters.table.TableConverter`.  The `PdfConverter` will convert the whole PDF, the `TableConverter` will only extract and convert tables.
 
-The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py).  If you don't need OCR, marker can work with any language.
+The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/recognition/languages.py).  If you don't need OCR, marker can work with any language.
 
 ## Convert multiple files
 
@@ -179,7 +181,7 @@ rendered = converter("FILEPATH")
 
 ### Extract blocks
 
-Each document consists of one or more pages.  Pages contain blocks, which can themselves contain other blocks.  It's possible to programatically manipulate these blocks.  
+Each document consists of one or more pages.  Pages contain blocks, which can themselves contain other blocks.  It's possible to programmatically manipulate these blocks.  
 
 Here's an example of extracting all forms from a document:
 
@@ -197,6 +199,33 @@ forms = document.contained_blocks((BlockTypes.Form,))
 
 Look at the processors for more examples of extracting and manipulating blocks.
 
+## Other converters
+
+You can also use other converters that define different conversion pipelines:
+
+### Extract tables
+
+The `TableConverter` will only convert and extract tables:
+
+```python
+from marker.converters.table import TableConverter
+from marker.models import create_model_dict
+from marker.output import text_from_rendered
+
+converter = TableConverter(
+    artifact_dict=create_model_dict(),
+)
+rendered = converter("FILEPATH")
+text, _, images = text_from_rendered(rendered)
+```
+
+This takes all the same configuration as the PdfConverter.  You can specify the configuration `--force_layout_block=Table` to avoid layout detection and instead assume every page is a table.
+
+You can also run this via the CLI with 
+```shell
+python convert_single.py FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter
+```
+
 # Output Formats
 
 ## Markdown
@@ -348,7 +377,7 @@ There are some settings that you may find useful if things aren't working the wa
 Pass the `debug` option to activate debug mode.  This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
 
 # Benchmarks
-
+## Overall PDF Conversion
 Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.  It's noisy, but at least directionally correct.
 
 **Speed**
@@ -371,6 +400,18 @@ Marker takes about 6GB of VRAM on average per task, so you can convert 8 documen
 
 ![Benchmark results](data/images/per_doc.png)
 
+## Table Conversion
+Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
+
+| Avg score | Total tables | use_llm |
+|-----------|--------------|---------|
+| 0.822     | 54           | False   |
+| 0.887     | 54           | True    |
+
+The `--use_llm` flag can significantly improve table recognition performance, as you can see.
+
+We filter out tables that we cannot align with the ground truth, since fintabnet and our layout model have slightly different detection methods (this results in some tables being split/merged).
+
 ## Running your own benchmarks
 
 You can benchmark the performance of marker on your machine. Install marker manually with:
@@ -380,12 +421,21 @@ git clone https://github.com/VikParuchuri/marker.git
 poetry install
 ```
 
+### Overall PDF Conversion
+
 Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
 
 ```shell
 python benchmarks/overall.py data/pdfs data/references report.json
 ```
 
+### Table Conversion
+The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
+
+```shell
+python benchmarks/table/table.py table_report.json --max_rows 1000
+```
+
 # Thanks
 
 This work would not have been possible without amazing open source models and datasets, including (but not limited to):
@@ -395,4 +445,4 @@ This work would not have been possible without amazing open source models and da
 - Pypdfium2/pdfium
 - DocLayNet from IBM
 
-Thank you to the authors of these models and datasets for making them available to the community!
+Thank you to the authors of these models and datasets for making them available to the community!
diff --git a/benchmarks/table/gemini.py b/benchmarks/table/gemini.py
@@ -0,0 +1,49 @@
+import json
+from PIL import Image
+import google.generativeai as genai
+from google.ai.generativelanguage_v1beta.types import content
+from marker.settings import settings
+
+prompt = """
+You're an expert document analyst who is good at turning tables in documents into HTML.  Analyze the provided image, and convert it to a faithful HTML representation.
+ 
+Guidelines:
+- Keep the HTML simple and concise.
+- Only include the <table> tag and contents.
+- Only use <table>, <tr>, and <td> tags.  Only use the colspan and rowspan attributes if necessary.  Do not use <tbody>, <thead>, or <th> tags.
+- Make sure the table is as faithful to the image as possible with the given tags.
+
+**Instructions**
+1. Analyze the image, and determine the table structure.
+2. Convert the table image to HTML, following the guidelines above.
+3. Output only the HTML for the table, starting with the <table> tag and ending with the </table> tag.
+""".strip()
+
+genai.configure(api_key=settings.GOOGLE_API_KEY)
+
+def gemini_table_rec(image: Image.Image):
+    schema = content.Schema(
+        type=content.Type.OBJECT,
+        required=["table_html"],
+        properties={
+            "table_html": content.Schema(
+                type=content.Type.STRING,
+            )
+        }
+    )
+
+    model = genai.GenerativeModel("gemini-1.5-flash")
+
+    responses = model.generate_content(
+        [image, prompt],  # According to gemini docs, it performs better if the image is the first element
+        stream=False,
+        generation_config={
+            "temperature": 0,
+            "response_schema": schema,
+            "response_mime_type": "application/json",
+        },
+        request_options={'timeout': 60}
+    )
+
+    output = responses.candidates[0].content.parts[0].text
+    return json.loads(output)["table_html"]
diff --git a/benchmarks/table/scoring.py b/benchmarks/table/scoring.py
@@ -0,0 +1,109 @@
+""""
+TEDS Code Adapted from https://github.com/ibm-aur-nlp/EDD
+"""
+
+import distance
+from apted import APTED, Config
+from apted.helpers import Tree
+from lxml import html
+from collections import deque
+
+def wrap_table_html(table_html:str)->str:
+    return f'<html><body>{table_html}</body></html>'
+
+class TableTree(Tree):
+    def __init__(self, tag, colspan=None, rowspan=None, content=None, *children):
+        self.tag = tag
+        self.colspan = colspan
+        self.rowspan = rowspan
+        self.content = content
+
+        # Sets self.name and self.children
+        super().__init__(tag, *children)
+
+    def bracket(self):
+        """Show tree using brackets notation"""
+        if self.tag == 'td':
+            result = '"tag": %s, "colspan": %d, "rowspan": %d, "text": %s' % \
+                     (self.tag, self.colspan, self.rowspan, self.content)
+        else:
+            result = '"tag": %s' % self.tag
+        for child in self.children:
+            result += child.bracket()
+        return "{{{}}}".format(result)
+
+class CustomConfig(Config):
+    @staticmethod
+    def maximum(*sequences):
+        return max(map(len, sequences))
+
+    def normalized_distance(self, *sequences):
+        return float(distance.levenshtein(*sequences)) / self.maximum(*sequences)
+
+    def rename(self, node1, node2):
+        if (node1.tag != node2.tag) or (node1.colspan != node2.colspan) or (node1.rowspan != node2.rowspan):
+            return 1.
+        if node1.tag == 'td':
+            if node1.content or node2.content:
+                return self.normalized_distance(node1.content, node2.content)
+        return 0.
+
+def tokenize(node):
+    """
+    Tokenizes table cells
+    """
+    global __tokens__
+    __tokens__.append('<%s>' % node.tag)
+    if node.text is not None:
+        __tokens__ += list(node.text)
+    for n in node.getchildren():
+        tokenize(n)
+    if node.tag != 'unk':
+        __tokens__.append('</%s>' % node.tag)
+    if node.tag != 'td' and node.tail is not None:
+            __tokens__ += list(node.tail)
+
+def tree_convert_html(node, convert_cell=False, parent=None):
+    """
+    Converts HTML tree to the format required by apted
+    """
+    global __tokens__
+    if node.tag == 'td':
+        if convert_cell:
+            __tokens__ = []
+            tokenize(node)
+            cell = __tokens__[1:-1].copy()
+        else:
+            cell = []
+        new_node = TableTree(node.tag,
+                             int(node.attrib.get('colspan', '1')),
+                             int(node.attrib.get('rowspan', '1')),
+                             cell, *deque())
+    else:
+        new_node = TableTree(node.tag, None, None, None, *deque())
+    if parent is not None:
+        parent.children.append(new_node)
+    if node.tag != 'td':
+        for n in node.getchildren():
+            tree_convert_html(n, convert_cell, new_node)
+    if parent is None:
+        return new_node
+
+def similarity_eval_html(pred, true, structure_only=False):
+    """
+    Computes TEDS score between the prediction and the ground truth of a given samples
+    """
+    pred, true = html.fromstring(pred), html.fromstring(true)
+    if pred.xpath('body/table') and true.xpath('body/table'):
+        pred = pred.xpath('body/table')[0]
+        true = true.xpath('body/table')[0]
+        n_nodes_pred = len(pred.xpath(".//*"))
+        n_nodes_true = len(true.xpath(".//*"))
+        tree_pred = tree_convert_html(pred, convert_cell=not structure_only)
+        tree_true = tree_convert_html(true, convert_cell=not structure_only)
+        n_nodes = max(n_nodes_pred, n_nodes_true)
+        distance = APTED(tree_pred, tree_true, CustomConfig()).compute_edit_distance()
+        return 1.0 - (float(distance) / n_nodes)
+    else:
+        return 0.0
+