Skip to content

Yet Another Document 2 Text for pdf/doc/html/rft/etc - Extract text - or - convert to simplified HTML to retain layout information

License

Notifications You must be signed in to change notification settings

aledbetter/yadoc2text

Repository files navigation

Yet Another Document 2 Text

Extract Text or simplified HTML

This utility is extracts text or text and some structural information from documents 
so that the information can be processed. The general use case is for NLP / NLU where the
document structural information are needed to add semantic context to the content. 
The text output is the same, just without the html tags.
OCR is not supported, this project does not (currently) work with images

build and run locally

1. go to base directory of branch, this will build the package with everything in it
2. prompt# brew install maven
3. mvn clean
4. mvn install
5. cd web
6. mvn jetty:run
   mvn jetty:run -Djetty.port=8099
   old: mvn jetty:run -Dhttp.port=8099
   old: mvn jetty:run -Djetty.http.port=8099   
7. index is test page for conversion (could use some additions)

Supported Document Types

Word: .doc, .docx, .dot
PDF: .pdf
html: .html, .htm, .mht
text: .text, .txt
richtext: .rtf

Converted file html tags

Title: <title>
Headings: <h1>, <h2>, <h3>...<hn>
Text: <b>, <u>, <i>
Structure: <p>, <header>, <footer>
Lists: <ol>, <ul>, <li>
Sections: <section>, <article>  TODO
Tables: TBD

Converted meta info

Document Type: 
	<meta name="doc-type" content="html">
Original Document Name: 
	<meta name="doc-name" content="test.html">
Created Time
	<meta name="doc-created" content="xxxxx">
Modified Time
	<meta name="doc-modified" content="xxx">
Author
	<meta name="doc-author" content="Bober Simthsonsons">
Language
	<meta name="doc-language" content="en">	
Url
	<meta name="doc-url" content="http://www.sample.com/moby-dick.html">	

About

Yet Another Document 2 Text for pdf/doc/html/rft/etc - Extract text - or - convert to simplified HTML to retain layout information

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages