New LaTeX OCR model; block visualizer; better links/references
Improved LaTeX OCR
We trained a new LaTeX OCR model that works a lot better overall. It will reliably output KaTeX-compatible math. It also operates on longer sequences than before.
The rendered output is on the right, original document on the left:
![image](https://private-user-images.githubusercontent.com/913340/407822592-a3158fd5-a027-4798-a58e-bf8e30af8d42.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxODA5MjMsIm5iZiI6MTczOTE4MDYyMywicGF0aCI6Ii85MTMzNDAvNDA3ODIyNTkyLWEzMTU4ZmQ1LWEwMjctNDc5OC1hNThlLWJmOGUzMGFmOGQ0Mi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwOTQzNDNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jNTQwYjdmNmU0OTFjNzA4MGVjZTY3YzQzMDQyNTM3M2UxODE2NWFlNGY0ZmM4ZmVmNmQ0ZjExMGI2NzQ3OTQ5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.fQv-C1EWwUDMP97G3L2zdJmivnx7xCPReThE2J1ZT5s)
Block visualization
You can now visualize blocks in the streamlit app, thanks to @jazzido . By selecting json output and checking "show blocks", you get a nice visualization where you can see how marker parsed the page. Clicking on blocks will show the HTML.
![image](https://private-user-images.githubusercontent.com/913340/407823086-04c83792-a6a8-429b-b596-5124e8a6b9c1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxODA5MjMsIm5iZiI6MTczOTE4MDYyMywicGF0aCI6Ii85MTMzNDAvNDA3ODIzMDg2LTA0YzgzNzkyLWE2YTgtNDI5Yi1iNTk2LTUxMjRlOGE2YjljMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwOTQzNDNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02NjNjODEyNTZiMDYwYmFkYmVkMzRlYTJkZDUzMTFlZDc4OWFiZGRjYzhkZGJlOGYzMjRhZTkxODRhZTQxOWJmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.H2RDx7N8IbHTxsBS7vT470MywvCR1CWGxDqRmVuVbS4)
Links and references
We fixed a bug with links and references, they now render as one block. You can see the extracted references here:
![image](https://private-user-images.githubusercontent.com/913340/407824416-109a289d-5fd2-4cbb-bf7c-903a32581d51.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxODA5MjMsIm5iZiI6MTczOTE4MDYyMywicGF0aCI6Ii85MTMzNDAvNDA3ODI0NDE2LTEwOWEyODlkLTVmZDItNGNiYi1iZjdjLTkwM2EzMjU4MWQ1MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQwOTQzNDNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xMjA5MzQzNmMxODVkMTk4NzBlZmU5NTc2OGUyNjQ1MDhhOWI1NjQ5NzA2OWExMmIzYzY5MmI3ODlkY2QyZDQ0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.DwVdG3goNdNIOrmqiIB3BKcKhV2cz77jn_2Z5W7ViK4)
Misc bugfixes
- Fixed some bugs with tables and row splitting
- Escaped $ inside text and tables so we don't accidentally render things as equations
What's Changed
- [streamlit_app] Visualize extracted blocks by @jazzido in #502
- Texify by @VikParuchuri in #513
- Update texify by @VikParuchuri in #514
New Contributors
Full Changelog: v1.3.2...v1.3.3