Skip to content

Commit

Permalink
Fix bbox bug
Browse files Browse the repository at this point in the history
  • Loading branch information
VikParuchuri committed May 27, 2024
1 parent 51266d8 commit 2557089
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 3 deletions.
8 changes: 6 additions & 2 deletions pdftext/extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,14 @@ def dictionary_output(pdf_path, sort=False, model=None, page_range=None, keep_ch
for page in pages:
page_width, page_height = page["width"], page["height"]
for block in page["blocks"]:
block = {k: v for k, v in block.items() if k in ["lines", "bbox"]}
for k in list(block.keys()):
if k not in ["lines", "bbox"]:
del block[k]
block["bbox"] = unnormalize_bbox(block["bbox"], page_width, page_height)
for line in block["lines"]:
line = {k: v for k, v in line.items() if k in ["bbox", "spans"]}
for k in list(line.keys()):
if k not in ["spans", "bbox"]:
del line[k]
line["bbox"] = unnormalize_bbox(line["bbox"], page_width, page_height)
for span in line["spans"]:
_process_span(span, page_width, page_height, keep_chars)
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "pdftext"
version = "0.3.9"
version = "0.3.10"
description = "Extract structured text from pdfs quickly"
authors = ["Vik Paruchuri <vik.paruchuri@gmail.com>"]
license = "Apache-2.0"
Expand Down

0 comments on commit 2557089

Please sign in to comment.