Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code formatting, update batch sizes #114

Merged
merged 10 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ jobs:
run: |
pip install poetry
poetry install
poetry remove torch
poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
- name: Build package
run: |
poetry build
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ jobs:
poetry run pip install torch --index-url https://download.pytorch.org/whl/cpu
- name: Download benchmark data
run: |
wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1ktVDYPEeyHlKLaF56FnHjI5VjVnYa1xL"
unzip benchmark_data.zip
wget -O benchmark_data.zip "https://drive.google.com/uc?export=download&id=1NHrdYatR1rtqs2gPVfdvO0BAvocH8CJi"
unzip -o benchmark_data.zip
- name: Run benchmark test
run: |
poetry run python benchmark.py benchmark_data/pdfs benchmark_data/references report.json
Expand Down
119 changes: 58 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# Marker

Marker converts PDF to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
Marker converts PDF to markdown quickly and accurately.

- Support for a range of documents (optimized for books and scientific papers)
- Supports a wide range of documents (optimized for books and scientific papers)
- Supports all languages
- Removes headers/footers/other artifacts
- Converts most equations to latex
- Formats tables and code blocks
- Support for all languages (although most testing is done in English).
- Extracts and saves images along with the markdown
- Converts most equations to latex
- Works on GPU, CPU, or MPS

## How it works
Expand Down Expand Up @@ -34,7 +35,7 @@ It only uses models where necessary, which improves speed and accuracy.

![Benchmark overall](data/images/overall.png)

The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.
The above results are with marker and nougat setup so they each take ~4GB of VRAM on an A6000.

See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Expand All @@ -46,47 +47,35 @@ See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instruc

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

- Marker will not convert 100% of equations to LaTeX. This is because it has to first detect equations, then convert them.
- Marker will not convert 100% of equations to LaTeX. This is because it has to detect then convert.
- Whitespace and indentations are not always respected.
- Not all lines/spans will be joined properly.
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.

# Installation

This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and [poetry](https://python-poetry.org/docs/#installing-with-the-official-installer).

First, clone the repo:
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.

- `git clone https://github.com/VikParuchuri/marker.git`
- `cd marker`
Install with:

## Linux

- Install python requirements
- `poetry install`
- `poetry shell` to activate your poetry venv
- Update pytorch since poetry doesn't play nicely with it
- GPU only: run `pip install torch` to install other torch dependencies.
- CPU only: Uninstall torch with `poetry remove torch`, then follow the [CPU install](https://pytorch.org/get-started/locally/) instructions.
```shell
pip install marker-pdf
```

**Optional**
## Optional

Only needed if using `ocrmypdf` as the ocr backend.

**Linux**

- Run `pip install ocrmypdf`
- Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
- Install other requirements with `cat scripts/install/tess-apt-requirements.txt | xargs sudo apt-get install -y`
- Set the tesseract data folder path
- Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it

## Mac

- Install python requirements
- `poetry install`
- `poetry shell` to activate your poetry venv

**Optional**
**Mac**

Only needed if using `ocrmypdf` as the ocr backend.

Expand All @@ -98,35 +87,30 @@ Only needed if using `ocrmypdf` as the ocr backend.

# Usage

First, some configuration. Note that settings can be overridden with env vars, or in a `local.env` file in the root `marker` folder.
First, some configuration. Note that settings can be overridden with env vars.

- Your torch device will be automatically detected, but you can manually set it also. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
- Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
- If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
- Depending on your document types, marker's average memory usage per task can vary slightly. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above).
- Inspect the other settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.

- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.

## Convert a single file

Run `convert_single.py`, like this:

```
python convert_single.py /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
```shell
marker_single /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
```

- `--parallel_factor` is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default.
- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
- `--langs` is a comma separated list of the languages in the document, for OCR

Make sure the `DEFAULT_LANG` setting is set appropriately for your document. The list of supported languages for OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`. If you don't need OCR, marker can work with any language.

## Convert multiple files

Run `convert.py`, like this:

```
python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
```shell
marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
```

- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
Expand All @@ -146,10 +130,8 @@ You can use language names or codes. The exact codes depend on the OCR engine.

## Convert multiple files on multiple GPUs

Run `chunk_convert.sh`, like this:

```
MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
```shell
MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
```

- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
Expand All @@ -159,45 +141,59 @@ MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bas

Note that the env variables above are specific to this script, and cannot be set in `local.env`.

# Important settings/Troubleshooting

There are some settings that you may find especially useful if things aren't working the way you expect:

- `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
- `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
- `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs

In general, if output is not what you expect, trying to OCR the PDF is a good first step.

# Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.

Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

**Speed**

| Method | Average Score | Time per page | Time per document |
|--------|---------------|---------------|-------------------|
| naive | 0.350727 | 0.00152378 | 0.326524 |
| marker | 0.641062 | 0.360622 | 77.2762 |
| nougat | 0.629211 | 3.77259 | 808.413 |
| marker | 0.613721 | 0.631991 | 58.1432 |
| nougat | 0.406603 | 2.59702 | 238.926 |

**Accuracy**

First 3 are non-arXiv books, last 3 are arXiv papers.

| Method | switch_trans.pdf | crowd.pdf | multicolcnn.pdf | thinkos.pdf | thinkdsp.pdf | thinkpython.pdf |
|--------|------------------|-----------|-----------------|-------------|--------------|-----------------|
| naive | 0.244114 | 0.140669 | 0.0868221 | 0.366856 | 0.412521 | 0.468281 |
| marker | 0.482091 | 0.466882 | 0.537062 | 0.754347 | 0.78825 | 0.779536 |
| nougat | 0.696458 | 0.552337 | 0.735099 | 0.655002 | 0.645704 | 0.650282 |
| Method | multicolcnn.pdf | switch_trans.pdf | thinkpython.pdf | thinkos.pdf | thinkdsp.pdf | crowd.pdf |
|--------|-----------------|------------------|-----------------|-------------|--------------|-----------|
| marker | 0.536176 | 0.516833 | 0.70515 | 0.710657 | 0.690042 | 0.523467 |
| nougat | 0.44009 | 0.588973 | 0.322706 | 0.401342 | 0.160842 | 0.525663 |

Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker. Benchmarks were run on an A6000.
Peak GPU memory usage during the benchmark is `4.2GB` for nougat, and `4.1GB` for marker. Benchmarks were run on an A6000 Ada.

**Throughput**

Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.
Marker takes about 4.5GB of VRAM on average per task, so you can convert 10 documents in parallel on an A6000.

![Benchmark results](data/images/per_doc.png)

## Running your own benchmarks

You can benchmark the performance of marker on your machine. First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.

Then run `benchmark.py` like this:
You can benchmark the performance of marker on your machine. Install marker manually with:

```shell
git clone https://github.com/VikParuchuri/marker.git
poetry install
```

Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run `benchmark.py` like this:

```shell
python benchmark.py data/pdfs data/references report.json --nougat
```

Expand All @@ -217,7 +213,8 @@ Note that the `ocrmypdf` OCR option will use ocrmypdf, which includes Ghostscrip

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

- Nougat from Meta
- Surya
- Texify
- Pypdfium2/pdfium
- DocLayNet from IBM
- ByT5 from Google
Expand Down
39 changes: 34 additions & 5 deletions benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,27 @@
import subprocess
import shutil
from tabulate import tabulate
import torch

configure_logging()


def start_memory_profiling():
torch.cuda.memory._record_memory_history(
max_entries=100000
)


def stop_memory_profiling(memory_file):
try:
torch.cuda.memory._dump_snapshot(memory_file)
except Exception as e:
logger.error(f"Failed to capture memory snapshot {e}")

# Stop recording memory snapshot history.
torch.cuda.memory._record_memory_history(enabled=None)


def nougat_prediction(pdf_filename, batch_size=1):
out_dir = tempfile.mkdtemp()
subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
Expand All @@ -37,28 +54,36 @@ def main():
parser.add_argument("out_file", help="Output filename")
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
# Nougat batch size 1 uses about as much VRAM as default marker settings
parser.add_argument("--marker_batch_multiplier", type=int, default=1, help="Batch size multiplier to use for marker when making predictions.")
parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
parser.add_argument("--marker_parallel_factor", type=int, default=1, help="How much to multiply default parallel OCR workers and model batch sizes by.")
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
parser.add_argument("--profile_memory", action="store_true", help="Profile memory usage", default=False)

args = parser.parse_args()

methods = ["naive", "marker"]
methods = ["marker"]
if args.nougat:
methods.append("nougat")

if args.profile_memory:
start_memory_profiling()

model_lst = load_all_models()

if args.profile_memory:
stop_memory_profiling("model_load.pickle")

scores = defaultdict(dict)
benchmark_files = os.listdir(args.in_folder)
benchmark_files = [b for b in benchmark_files if b.endswith(".pdf")]
times = defaultdict(dict)
pages = defaultdict(int)

for fname in tqdm(benchmark_files):
for idx, fname in tqdm(enumerate(benchmark_files)):
md_filename = fname.rsplit(".", 1)[0] + ".md"

reference_filename = os.path.join(args.reference_folder, md_filename)
with open(reference_filename, "r") as f:
with open(reference_filename, "r", encoding="utf-8") as f:
reference = f.read()

pdf_filename = os.path.join(args.in_folder, fname)
Expand All @@ -68,7 +93,11 @@ def main():
for method in methods:
start = time.time()
if method == "marker":
full_text, _, out_meta = convert_single_pdf(pdf_filename, model_lst, parallel_factor=args.marker_parallel_factor)
if args.profile_memory:
start_memory_profiling()
full_text, _, out_meta = convert_single_pdf(pdf_filename, model_lst, batch_multiplier=args.marker_batch_multiplier)
if args.profile_memory:
stop_memory_profiling(f"marker_memory_{idx}.pickle")
elif method == "nougat":
full_text = nougat_prediction(pdf_filename, batch_size=args.nougat_batch_size)
elif method == "naive":
Expand Down
5 changes: 4 additions & 1 deletion chunk_convert.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import argparse
import subprocess
import pkg_resources


def main():
Expand All @@ -8,8 +9,10 @@ def main():
parser.add_argument("out_folder", help="Output folder")
args = parser.parse_args()

script_path = pkg_resources.resource_filename(__name__, 'chunk_convert.sh')

# Construct the command
cmd = f"./chunk_convert.sh {args.in_folder} {args.out_folder}"
cmd = f"{script_path} {args.in_folder} {args.out_folder}"

# Execute the shell script
subprocess.run(cmd, shell=True, check=True)
Expand Down
2 changes: 1 addition & 1 deletion chunk_convert.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ for (( i=0; i<$NUM_DEVICES; i++ )); do
export NUM_DEVICES
export NUM_WORKERS
echo "Running convert.py on GPU $DEVICE_NUM"
cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM python convert.py $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM marker $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
[[ -n "$METADATA_FILE" ]] && cmd="$cmd --metadata_file $METADATA_FILE"
[[ -n "$MIN_LENGTH" ]] && cmd="$cmd --min_length $MIN_LENGTH"
eval $cmd &
Expand Down
Loading
Loading