GitHub - harshravireddy/converting_scanned_document_to_TEI_XML

This research addresses the crucial task of information retrieval from historical documents, specifically those from the 18th and 19th centuries. Optical Character Recognition (OCR) technology serves as a foundation for this process, enabling the conversion of scanned documents into machine-readable text format. A prominent OCR solution, Tesseract, was employed in this study due to its capability of recognizing a vast array of languages. This facilitated the extraction of valuable information and metadata from the digitized historical materials. Furthermore, standardized OCR-D processors were utilized to convert the raw data from its original PDF format into a structured TEI XML format. This transformation ensures the confidentiality and controlled access of the historical data for future research endeavors and analytical purposes. The implemented methodology offers a systematic approach to information gathering and retrieval from historical documents. This approach aligns with the broader efforts of digitalization and archival preservation, fostering continued exploration and understanding of historical records.

The contents of the files include the following,

All 24 document folders are uploaded here.
Each document folder has the processed pages after going through the processing steps by processors.
The final output folder has , combined hocr file , tei xml python code and tei xml output file.
python code for splitting tiff document into indiviual document and combining hocr code is also included in the main branch.
It also has our Report Document.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
141739		141739
141748		141748
141813		141813
141903		141903
141951		141951
142025		142025
142137		142137
142207		142207
142317		142317
142440		142440
142539		142539
142640		142640
142731		142731
142806		142806
142856-2-2		142856-2-2
142856_1-2		142856_1-2
143122		143122
143209		143209
Feinpolier		Feinpolier
Muster_0002		Muster_0002
Test_BIBB_Fraktur		Test_BIBB_Fraktur
deu_damenschiderin		deu_damenschiderin
edelmetallpruefer_1938_pruefungsanforderungen		edelmetallpruefer_1938_pruefungsanforderungen
werkgehilfin_1937_pruefungsanforderungen_Doc		werkgehilfin_1937_pruefungsanforderungen_Doc
Converting__scanned_documents_to_TEI_XML Report.pdf		Converting__scanned_documents_to_TEI_XML Report.pdf
README.md		README.md
combining_hocr.py		combining_hocr.py
splitting_Tiff_pages.ipynb		splitting_Tiff_pages.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

harshravireddy/converting_scanned_document_to_TEI_XML

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages