Skip to content

Commit

Permalink
Update examples, fix table cells with lines
Browse files Browse the repository at this point in the history
  • Loading branch information
VikParuchuri committed Jan 24, 2025
1 parent 98aaaba commit fb4da2b
Show file tree
Hide file tree
Showing 78 changed files with 142,626 additions and 44,915 deletions.
8,266 changes: 7,123 additions & 1,143 deletions data/examples/json/multicolcnn.json

Large diffs are not rendered by default.

33,140 changes: 27,408 additions & 5,732 deletions data/examples/json/switch_trans.json

Large diffs are not rendered by default.

132,304 changes: 99,758 additions & 32,546 deletions data/examples/json/thinkpython.json

Large diffs are not rendered by default.

Binary file modified data/examples/markdown/multicolcnn/_page_1_Figure_0.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/multicolcnn/_page_2_Picture_0.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/multicolcnn/_page_6_Figure_0.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/multicolcnn/_page_7_Figure_0.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
283 changes: 137 additions & 146 deletions data/examples/markdown/multicolcnn/multicolcnn.md

Large diffs are not rendered by default.

367 changes: 221 additions & 146 deletions data/examples/markdown/multicolcnn/multicolcnn_meta.json

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_18_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_18_Figure_3.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_20_Figure_4.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_27_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_29_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_2_Figure_3.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_30_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_31_Figure_3.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_4_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/switch_transformers/_page_5_Figure_3.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
963 changes: 492 additions & 471 deletions data/examples/markdown/switch_transformers/switch_trans.md

Large diffs are not rendered by default.

1,249 changes: 856 additions & 393 deletions data/examples/markdown/switch_transformers/switch_trans_meta.json

Large diffs are not rendered by default.

Binary file modified data/examples/markdown/thinkpython/_page_109_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/thinkpython/_page_116_Figure_3.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/thinkpython/_page_127_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/thinkpython/_page_127_Figure_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/thinkpython/_page_128_Figure_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/thinkpython/_page_128_Figure_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified data/examples/markdown/thinkpython/_page_167_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_169_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_190_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_195_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_205_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_205_Figure_1.png
Binary file modified data/examples/markdown/thinkpython/_page_230_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_230_Figure_1.png
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file modified data/examples/markdown/thinkpython/_page_233_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_233_Figure_1.png
Binary file modified data/examples/markdown/thinkpython/_page_233_Figure_3.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_233_Figure_3.png
Binary file modified data/examples/markdown/thinkpython/_page_234_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_235_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_236_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_236_Figure_1.png
Binary file modified data/examples/markdown/thinkpython/_page_236_Figure_3.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_237_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_238_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_23_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_23_Figure_1.png
Binary file modified data/examples/markdown/thinkpython/_page_23_Figure_3.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_33_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_33_Figure_1.png
Binary file modified data/examples/markdown/thinkpython/_page_46_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_60_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_60_Figure_3.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_67_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_71_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_71_Figure_1.png
Binary file modified data/examples/markdown/thinkpython/_page_78_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_78_Figure_1.png
Binary file modified data/examples/markdown/thinkpython/_page_85_Figure_1.jpeg
Binary file modified data/examples/markdown/thinkpython/_page_85_Figure_1.png
3,743 changes: 2,004 additions & 1,739 deletions data/examples/markdown/thinkpython/thinkpython.md

Large diffs are not rendered by default.

7,190 changes: 4,600 additions & 2,590 deletions data/examples/markdown/thinkpython/thinkpython_meta.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion marker/processors/llm/llm_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ def parse_html_table(self, html_text: str, block: Block, page: PageGroup) -> Lis
cell_polygon = PolygonBox.from_bbox(cell_bbox)

cell_obj = TableCell(
text=cell_text,
text_lines=[cell_text],
row_id=i,
col_id=cur_col,
rowspan=rowspan,
Expand Down
16 changes: 11 additions & 5 deletions marker/processors/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ def __call__(self, document: Document):

cell_block = TableCell(
polygon=cell_polygon,
text=self.finalize_cell_text(cell),
text_lines=self.finalize_cell_text(cell),
rowspan=cell.rowspan,
colspan=cell.colspan,
row_id=cell.row_id,
Expand All @@ -135,10 +135,16 @@ def __call__(self, document: Document):
page.structure.remove(child.id)

def finalize_cell_text(self, cell: SuryaTableCell):
text = "\n".join([t["text"].strip() for t in cell.text_lines]) if cell.text_lines else ""
text = re.sub(r"(\s\.){2,}", "", text) # Replace . . .
text = re.sub(r"\.{2,}", "", text) # Replace ..., like in table of contents
return self.normalize_spaces(fix_text(text))
fixed_text = []
text_lines = cell.text_lines if cell.text_lines else []
for line in text_lines:
text = line["text"].strip()
if not text or text == ".":
continue
text = re.sub(r"(\s\.){2,}", "", text) # Replace . . .
text = re.sub(r"\.{2,}", "", text) # Replace ..., like in table of contents
fixed_text.append(self.normalize_spaces(fix_text(text)))
return fixed_text

@staticmethod
def normalize_spaces(text):
Expand Down
5 changes: 4 additions & 1 deletion marker/renderers/markdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ def cleanup_text(full_text):
full_text = re.sub(r'(\n\s){3,}', '\n\n', full_text)
return full_text.strip()

def get_text_with_br(element):
return ''.join(str(content) if content.name == 'br' else content.strip() for content in element.contents)


class Markdownify(MarkdownConverter):
def __init__(self, paginate_output, page_separator, inline_math_delimiters, block_math_delimiters, **kwargs):
Expand Down Expand Up @@ -78,7 +81,7 @@ def convert_table(self, el, text, convert_as_inline):
col_idx += 1

# Fill in grid
value = cell.get_text(strip=True).replace("\n", " ").replace("|", " ")
value = get_text_with_br(cell).replace("\n", " ").replace("|", " ")
rowspan = int(cell.get('rowspan', 1))
colspan = int(cell.get('colspan', 1))

Expand Down
13 changes: 11 additions & 2 deletions marker/schema/blocks/tablecell.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from typing import List

from marker.schema import BlockTypes
from marker.schema.blocks import Block

Expand All @@ -9,14 +11,21 @@ class TableCell(Block):
row_id: int
col_id: int
is_header: bool
text: str = ""
text_lines: List[str] | None = None
block_description: str = "A cell in a table."

@property
def text(self):
return "\n".join(self.text_lines)

def assemble_html(self, document, child_blocks, parent_structure=None):
tag_cls = "th" if self.is_header else "td"
tag = f"<{tag_cls}"
if self.rowspan > 1:
tag += f" rowspan={self.rowspan}"
if self.colspan > 1:
tag += f" colspan={self.colspan}"
return f"{tag}>{self.text}</{tag_cls}>"
if self.text_lines is None:
self.text_lines = []
text = "<br>".join(self.text_lines)
return f"{tag}>{text}</{tag_cls}>"

0 comments on commit fb4da2b

Please sign in to comment.