Home

Classification and extraction of information from ETD Documents

Long Title:

Identifying methods to accurately extract information (citation, tables, figures, etc.) from ETDs and to appropriately classify them into the ProQuest subject category classification system.

List of People (and contact information):

Bill Ingram
Bipasha Banerjee
Palakh Mignonne Jude
John Aromando
Sampanna Yashwant Kahu

Deliverables:

A diverse collection of ETDs, as well as the service on top for extraction information and classifying the ETDs.

Approach:

Our approach to parsing citations from ETDs is to research and utilize state-of-the-art nlp tools that 1) already aim to accomplish the same goal 2) give information that can be used as features to “define” the context of citations (dependency and semantic parsers/word embeddings).

Our approach for figure, table and caption extraction will involve researching and evaluating the performance of current state-of-the-art tools that achieve the same goal on our dataset of ETDs. Further, we will also try to improve the model by identifying the instances where the current state-of-the-art model fails.

Our approach to classification will involve dropping the top most level of ProQuest subject categories while keeping the next two levels. We will train a neural network architecture using metadata of the ETDs, abstract information as well as attempting to see if adding full text data helps in the classification task.

Related Projects:

Big Data Text Summarization (Fall 2018):
- Team 17
- Team 10
- Team 16
Ashish Baghudana's text summarization project
Neural-ParsCit
Deepfigures-open Github

Description:

A lot of techniques that exist for processing digital documents do not extend well to book length documents such as theses and dissertations. Thus, there is a need to develop techniques that are capable of extracting information from book length documents.

Our project will consist of three areas:

Citation Parsing: As part of the project, we will aim to accurately extract citations from ETDs using various NLP tools. Furthermore, we aim to identify particular pieces of information within the citations such as the author names. Ideally, we hope to use and adapted Neural-ParsCit to accomplish these tasks.
Figure and caption extraction: As part of the project, we aim to accurately extract the figures, tables and the corresponding captions from our collections of ETD. Ideally we hope to use and adapt DeepFigures to accomplish to this task.
Categorization: As part of the project, we aim to perform multi-class classification of ETD documents using the ProQuest subject categories as the target classification system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Classification and extraction of information from ETD Documents

Long Title:

List of People (and contact information):

Deliverables:

Approach:

Related Projects:

Description:

Our project will consist of three areas:

Data:

Tools:

Additional Comments:

Clone this wiki locally