Skip to content

Commit

Permalink
Drop validation
Browse files Browse the repository at this point in the history
  • Loading branch information
VikParuchuri committed Apr 25, 2024
1 parent cc6a6e4 commit fae2334
Show file tree
Hide file tree
Showing 6 changed files with 31 additions and 18 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,6 @@ jobs:
poetry install
- name: Run detection benchmark test
run: |
poetry run python benchmark.py --max 5 --result_path results
poetry run python benchmark.py --max 5 --result_path results --pdftext_only
poetry run python scripts/verify_benchmark_scores.py results/results.json
19 changes: 12 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Text extraction like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2), so it's [fast, accurate](#benchmarks), and Apache licensed.

## Community

[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.

# Installation

You'll need python 3.9+ first. Then run `pip install pdftext`.
Expand Down Expand Up @@ -77,25 +81,25 @@ If you want more customization, check out the `pdftext.extraction._get_pages` fu

# Benchmarks

I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual words/lines and bbox information.
I benchmarked extraction speed and accuracy of [pymupdf](https://pymupdf.readthedocs.io/en/latest/), [pdfplumber](https://github.com/jsvine/pdfplumber), and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.

Here are the scores:
Here are the scores, run on an M1 Macbook, without multiprocessing:

| Library | Time (s per page) | Alignment Score (% accuracy vs pymupdf) |
|------------|-------------------|-----------------------------------------|
| pymupdf | 0.32 | -- |
| pdftext | 1.79 | 96.22 |
| pdfplumber | 3.0 | 89.88 |
| pdftext | 1.57 | 97.66 |
| pdfplumber | 3.0 | 90.3 |

pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same information).
pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).

There are additional benchmarks for pypdfium2 and other tools [here](https://github.com/py-pdf/benchmarks).

## Methodology

I used a benchmark set of 200 pdfs extracted from [common crawl](https://huggingface.co/datasets/pixparse/pdfa-eng-wds), then processed by a team at HuggingFace.

For each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers.
For each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers. I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks. For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.

For the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.

Expand All @@ -114,10 +118,11 @@ The benchmark script has a few options:

- `--max` this controls the maximum number of pdfs to benchmark
- `--result_path` a folder to save the results. A file called `results.json` will be created in the folder.
- `--pdftext_only` skip running pdfplumber, which can be slow.

# How it works

PDFText is a very light wrapper around pypdfium2. It first uses pypdfium2 to extract characters in order, along with font and other information. Then it uses a simple decision tree algorithm to group characters into lines and blocks. It then done some simple postprocessing to clean up the text.
PDFText is a very light wrapper around pypdfium2. It first uses pypdfium2 to extract characters in order, along with font and other information. Then it uses a simple decision tree algorithm to group characters into lines and blocks. It does some simple postprocessing to clean up the text.

# Credits

Expand Down
14 changes: 10 additions & 4 deletions benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ def pymupdf_inference(pdf_path):
for line in block["lines"]:
for span in line["spans"]:
text += span["text"]
if not text.endswith("\n"):
text += "\n\n"
text = text.rstrip() + "\n"
text = text.rstrip() + "\n\n"
pages.append(text)
return pages

Expand All @@ -41,8 +41,10 @@ def pdfplumber_inference(pdf_path):
pages = []
for i in range(len(pdf.pages)):
page = pdf.pages[i]
words = page.extract_words(use_text_flow=True)
text = "".join([word["text"] for word in words])
lines = page.extract_text_lines(strip=False, return_chars=True, keep_text_flow=True)
text = ""
for line in lines:
text += line["text"].rstrip() + "\n"
pages.append(text)
return pages

Expand All @@ -55,6 +57,7 @@ def main():
parser = argparse.ArgumentParser(description="Benchmark pdf extraction.")
parser.add_argument("--result_path", type=str, help="Path to the output text file, defaults to stdout", default=None)
parser.add_argument("--max", type=int, help="Maximum number of pages to process.", default=None)
parser.add_argument("--pdftext_only", action="store_true", help="Only run pdftext inference", default=False)
args = parser.parse_args()

split = "train"
Expand All @@ -66,6 +69,9 @@ def main():
alignments = defaultdict(list)
times_tools = ["pymupdf", "pdftext", "pdfplumber"]
alignment_tools = ["pdftext", "pdfplumber"]
if args.pdftext_only:
times_tools = ["pdftext", "pymupdf"]
alignment_tools = ["pdftext"]
model = get_model()
for i in tqdm(range(len(dataset)), desc="Benchmarking"):
row = dataset[i]
Expand Down
6 changes: 5 additions & 1 deletion pdftext/inference.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from itertools import chain

import sklearn

from pdftext.pdf.utils import LINE_BREAKS, TABS, SPACES


Expand Down Expand Up @@ -152,7 +154,9 @@ def inference(text_chars, model):
training_rows = [tl[1] for tl in training_list]
training_idxs = [tl[0] for tl in training_list]

predictions = model.predict(training_rows)
# Disable nan, etc, validation for a small speedup
with sklearn.config_context(assume_finite=True):
predictions = model.predict(training_rows)
for pred, page_idx in zip(predictions, training_idxs):
next_prediction[page_idx] = pred
page_blocks = sorted(page_blocks.items())
Expand Down
6 changes: 2 additions & 4 deletions pdftext/postprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,9 @@ def merge_text(page: Dict, sort=False) -> str:
for char in line["chars"]:
line_text += char["char"]
line_text = postprocess_text(line_text)
if line_text.endswith("\n"):
line_text = line_text[:-1].strip() + " "
line_text = line_text.rstrip() + "\n"

block_text += line_text
if not block_text.endswith("\n"):
block_text += "\n\n"
block_text = block_text.rstrip() + "\n\n"
text += block_text
return text
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "pdftext"
version = "0.1.1"
version = "0.1.2"
description = "Extract structured text from pdfs quickly"
authors = ["Vik Paruchuri <vik.paruchuri@gmail.com>"]
license = "Apache-2.0"
Expand Down

0 comments on commit fae2334

Please sign in to comment.