VikParuchuri · VikParuchuri · May 9, 2024 · May 8, 2024 · May 8, 2024 · May 8, 2024
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -16,8 +16,6 @@ jobs:
         run: |
           pip install poetry
           poetry install
-          poetry remove torch
-          poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
       - name: Build package
         run: |
           poetry build

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -23,8 +23,8 @@ jobs:
           poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
       - name: Download benchmark data
         run: |
-          wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1ktVDYPEeyHlKLaF56FnHjI5VjVnYa1xL"
-          unzip benchmark_data.zip
+          wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1NHrdYatR1rtqs2gPVfdvO0BAvocH8CJi"
+          unzip -o benchmark_data.zip
       - name: Run benchmark test
         run: |
           poetry run python benchmark.py benchmark_data/pdfs benchmark_data/references report.json

diff --git a/README.md b/README.md
@@ -1,12 +1,13 @@
 # Marker
 
-Marker converts PDF to markdown.  It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
+Marker converts PDF to markdown quickly and accurately.
 
-- Support for a range of documents (optimized for books and scientific papers)
+- Supports a wide range of documents (optimized for books and scientific papers)
+- Supports all languages
 - Removes headers/footers/other artifacts
-- Converts most equations to latex
 - Formats tables and code blocks
-- Support for all languages (although most testing is done in English).
+- Extracts and saves images along with the markdown
+- Converts most equations to latex
 - Works on GPU, CPU, or MPS
 
 ## How it works
@@ -34,7 +35,7 @@ It only uses models where necessary, which improves speed and accuracy.
 
 ![Benchmark overall](data/images/overall.png)
 
-The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.
+The above results are with marker and nougat setup so they each take ~4GB of VRAM on an A6000.
 
 See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
 
@@ -46,47 +47,35 @@ See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instruc
 
 PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
 
-- Marker will not convert 100% of equations to LaTeX.  This is because it has to first detect equations, then convert them.
+- Marker will not convert 100% of equations to LaTeX.  This is because it has to detect then convert.
 - Whitespace and indentations are not always respected.
 - Not all lines/spans will be joined properly.
 - This works best on digital PDFs that won't require a lot of OCR.  It's optimized for speed, and limited OCR is used to fix errors.
 
 # Installation
 
-This has been tested on Mac and Linux (Ubuntu and Debian).  You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).
-
-First, clone the repo:
+This has been tested on Mac and Linux (Ubuntu and Debian).  You'll need python 3.9+ and PyTorch.  You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine.  See [here](https://pytorch.org/get-started/locally/) for more details.
 
-- `git clone https://github.com/VikParuchuri/marker.git`
-- `cd marker`
+Install with:
 
-## Linux
-
-- Install python requirements
-  - `poetry install`
-  - `poetry shell` to activate your poetry venv
-- Update pytorch since poetry doesn't play nicely with it
-  - GPU only: run `pip install torch` to install other torch dependencies.
-  - CPU only: Uninstall torch with `poetry remove torch`, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
+```shell
+pip install marker-pdf
+```
 
-**Optional**
+## Optional
 
 Only needed if using `ocrmypdf` as the ocr backend.
 
+**Linux**
+
 - Run `pip install ocrmypdf`
 - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
 - Install other requirements with `cat scripts/install/tess-apt-requirements.txt | xargs sudo apt-get install -y`
 - Set the tesseract data folder path
   - Find the tesseract data folder `tessdata` with `find / -name tessdata`.  Make sure to use the one corresponding to the latest tesseract version if you have multiple.
   - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
 
-## Mac
-
-- Install python requirements
-  - `poetry install`
-  - `poetry shell` to activate your poetry venv
-
-**Optional**
+**Mac**
 
 Only needed if using `ocrmypdf` as the ocr backend.
 
@@ -98,35 +87,30 @@ Only needed if using `ocrmypdf` as the ocr backend.
 
 # Usage
 
-First, some configuration.  Note that settings can be overridden with env vars, or in a `local.env` file in the root `marker` folder.
+First, some configuration.  Note that settings can be overridden with env vars.
 
-- Your torch device will be automatically detected, but you can manually set it also.  For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
+- Inspect the settings in `marker/settings.py`.  You can override any settings with environment variables.
+- Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
   - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU).  For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
   - Depending on your document types, marker's average memory usage per task can vary slightly.  You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
-- By default, marker will use `surya` for OCR.  Surya is slower on CPU, but more accurate than tesseract.  If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above).
-- Inspect the other settings in `marker/settings.py`.  You can override any settings in the `local.env` file, or by setting environment variables.
-
+- By default, marker will use `surya` for OCR.  Surya is slower on CPU, but more accurate than tesseract.  If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above).  If you don't want OCR at all, set `OCR_ENGINE` to `None`.
 
 ## Convert a single file
 
-Run `convert_single.py`, like this:
-
-```
-python convert_single.py /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
+```shell
+marker_single /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
 ```
 
-- `--parallel_factor` is how much to increase batch size and parallel OCR workers by.  Higher numbers will take more VRAM and CPU, but process faster.  Set to 1 by default.
+- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM.  Higher numbers will take more VRAM, but process faster.  Set to 2 by default.  The default batch sizes will take ~3GB of VRAM.
 - `--max_pages` is the maximum number of pages to process.  Omit this to convert the entire document.
 - `--langs` is a comma separated list of the languages in the document, for OCR
 
 Make sure the `DEFAULT_LANG` setting is set appropriately for your document.  The list of supported languages for OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py).  If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`.  If you don't need OCR, marker can work with any language.
 
 ## Convert multiple files
 
-Run `convert.py`, like this:
-
-```
-python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
+```shell
+marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
 ```
 
 - `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
@@ -146,10 +130,8 @@ You can use language names or codes.  The exact codes depend on the OCR engine.
 
 ## Convert multiple files on multiple GPUs
 
-Run `chunk_convert.sh`, like this:
-
-```
-MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
+```shell
+MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
 ```
 
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
@@ -159,45 +141,59 @@ MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bas
 
 Note that the env variables above are specific to this script, and cannot be set in `local.env`.
 
+# Important settings/Troubleshooting
+
+There are some settings that you may find especially useful if things aren't working the way you expect:
+
+- `OCR_ALL_PAGES` - set this to true to force OCR all pages.  This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
+- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
+- `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
+- `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs
+
+In general, if output is not what you expect, trying to OCR the PDF is a good first step.
+
 # Benchmarks
 
-Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.
+Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.  It's noisy, but at least directionally correct.
 
-Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).  We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
+Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).  We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
 
 **Speed**
 
 | Method | Average Score | Time per page | Time per document |
 |--------|---------------|---------------|-------------------|
-| naive  | 0.350727      | 0.00152378    | 0.326524          |
-| marker | 0.641062      | 0.360622      | 77.2762           |
-| nougat | 0.629211      | 3.77259       | 808.413           |
+| marker | 0.613721      | 0.631991      | 58.1432           |
+| nougat | 0.406603      | 2.59702       | 238.926           |
 
 **Accuracy**
 
 First 3 are non-arXiv books, last 3 are arXiv papers.
 
-| Method | switch_trans.pdf | crowd.pdf | multicolcnn.pdf | thinkos.pdf | thinkdsp.pdf | thinkpython.pdf |
-|--------|------------------|-----------|-----------------|-------------|--------------|-----------------|
-| naive  | 0.244114         | 0.140669  | 0.0868221       | 0.366856    | 0.412521     | 0.468281        |
-| marker | 0.482091         | 0.466882  | 0.537062        | 0.754347    | 0.78825      | 0.779536        |
-| nougat | 0.696458         | 0.552337  | 0.735099        | 0.655002    | 0.645704     | 0.650282        |
+| Method | multicolcnn.pdf | switch_trans.pdf | thinkpython.pdf | thinkos.pdf | thinkdsp.pdf | crowd.pdf |
+|--------|-----------------|------------------|-----------------|-------------|--------------|-----------|
+| marker | 0.536176        | 0.516833         | 0.70515         | 0.710657    | 0.690042     | 0.523467  |
+| nougat | 0.44009         | 0.588973         | 0.322706        | 0.401342    | 0.160842     | 0.525663  |
 
-Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker.  Benchmarks were run on an A6000.
+Peak GPU memory usage during the benchmark is `4.2GB` for nougat, and `4.1GB` for marker.  Benchmarks were run on an A6000 Ada.
 
 **Throughput**
 
-Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.
+Marker takes about 4.5GB of VRAM on average per task, so you can convert 10 documents in parallel on an A6000.
 
 ![Benchmark results](data/images/per_doc.png)
 
 ## Running your own benchmarks
 
-You can benchmark the performance of marker on your machine.  First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.
-
-Then run `benchmark.py` like this:
+You can benchmark the performance of marker on your machine. Install marker manually with:
 
+```shell
+git clone https://github.com/VikParuchuri/marker.git
+poetry install
 ```
+
+Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run `benchmark.py` like this:
+
+```shell
 python benchmark.py data/pdfs data/references report.json --nougat
 ```
 
@@ -217,7 +213,8 @@ Note that the `ocrmypdf` OCR option will use ocrmypdf, which includes Ghostscrip
 
 This work would not have been possible without amazing open source models and datasets, including (but not limited to):
 
-- Nougat from Meta
+- Surya
+- Texify
 - Pypdfium2/pdfium
 - DocLayNet from IBM
 - ByT5 from Google

diff --git a/benchmark.py b/benchmark.py
@@ -16,10 +16,27 @@
 import subprocess
 import shutil
 from tabulate import tabulate
+import torch
 
 configure_logging()
 
 
+def start_memory_profiling():
+    torch.cuda.memory._record_memory_history(
+        max_entries=100000
+    )
+
+
+def stop_memory_profiling(memory_file):
+    try:
+        torch.cuda.memory._dump_snapshot(memory_file)
+    except Exception as e:
+        logger.error(f"Failed to capture memory snapshot {e}")
+
+        # Stop recording memory snapshot history.
+    torch.cuda.memory._record_memory_history(enabled=None)
+
+
 def nougat_prediction(pdf_filename, batch_size=1):
     out_dir = tempfile.mkdtemp()
     subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
@@ -37,28 +54,36 @@ def main():
     parser.add_argument("out_file", help="Output filename")
     parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
     # Nougat batch size 1 uses about as much VRAM as default marker settings
+    parser.add_argument("--marker_batch_multiplier", type=int, default=1, help="Batch size multiplier to use for marker when making predictions.")
     parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
-    parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
     parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
+    parser.add_argument("--profile_memory", action="store_true", help="Profile memory usage", default=False)
+
     args = parser.parse_args()
 
-    methods = ["naive", "marker"]
+    methods = ["marker"]
     if args.nougat:
         methods.append("nougat")
 
+    if args.profile_memory:
+        start_memory_profiling()
+
     model_lst = load_all_models()
 
+    if args.profile_memory:
+        stop_memory_profiling("model_load.pickle")
+
     scores = defaultdict(dict)
     benchmark_files = os.listdir(args.in_folder)
     benchmark_files = [b for b in benchmark_files if b.endswith(".pdf")]
     times = defaultdict(dict)
     pages = defaultdict(int)
 
-    for fname in tqdm(benchmark_files):
+    for idx, fname in tqdm(enumerate(benchmark_files)):
         md_filename = fname.rsplit(".", 1)[0] + ".md"
 
         reference_filename = os.path.join(args.reference_folder, md_filename)
-        with open(reference_filename, "r") as f:
+        with open(reference_filename, "r", encoding="utf-8") as f:
             reference = f.read()
 
         pdf_filename = os.path.join(args.in_folder, fname)
@@ -68,7 +93,11 @@ def main():
         for method in methods:
             start = time.time()
             if method == "marker":
-                full_text, _, out_meta = convert_single_pdf(pdf_filename, model_lst, parallel_factor=args.marker_parallel_factor)
+                if args.profile_memory:
+                    start_memory_profiling()
+                full_text, _, out_meta = convert_single_pdf(pdf_filename, model_lst, batch_multiplier=args.marker_batch_multiplier)
+                if args.profile_memory:
+                    stop_memory_profiling(f"marker_memory_{idx}.pickle")
             elif method == "nougat":
                 full_text = nougat_prediction(pdf_filename, batch_size=args.nougat_batch_size)
             elif method == "naive":

diff --git a/chunk_convert.py b/chunk_convert.py
@@ -1,5 +1,6 @@
 import argparse
 import subprocess
+import pkg_resources
 
 
 def main():
@@ -8,8 +9,10 @@ def main():
     parser.add_argument("out_folder", help="Output folder")
     args = parser.parse_args()
 
+    script_path = pkg_resources.resource_filename(__name__, 'chunk_convert.sh')
+
     # Construct the command
-    cmd = f"./chunk_convert.sh {args.in_folder} {args.out_folder}"
+    cmd = f"{script_path} {args.in_folder} {args.out_folder}"
 
     # Execute the shell script
     subprocess.run(cmd, shell=True, check=True)

diff --git a/chunk_convert.sh b/chunk_convert.sh
@@ -35,7 +35,7 @@ for (( i=0; i<$NUM_DEVICES; i++ )); do
     export NUM_DEVICES
     export NUM_WORKERS
     echo "Running convert.py on GPU $DEVICE_NUM"
-    cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM python convert.py $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
+    cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM marker $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
     [[ -n "$METADATA_FILE" ]] && cmd="$cmd --metadata_file $METADATA_FILE"
     [[ -n "$MIN_LENGTH" ]] && cmd="$cmd --min_length $MIN_LENGTH"
     eval $cmd &