-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Identifying methods to accurately extract information (citation, tables, figures, etc.) from ETDs and to appropriately classify them into the ProQuest subject category classification system.
- Bill Ingram
- Bipasha Banerjee
- Palakh Mignonne Jude
- John Aromando
- Sampanna Yashwant Kahu
A diverse collection of ETDs, as well as the service on top for extraction information and classifying the ETDs.
Our approach to parsing citations from ETDs is to research and utilize state-of-the-art nlp tools that 1) already aim to accomplish the same goal 2) give information that can be used as features to “define” the context of citations (dependency and semantic parsers/word embeddings).
Our approach for figure, table and caption extraction will involve researching and evaluating the performance of current state-of-the-art tools that achieve the same goal on our dataset of ETDs. Further, we will also try to improve the model by identifying the instances where the current state-of-the-art model fails.
Our approach to classification will involve dropping the top most level of ProQuest subject categories while keeping the next two levels. We will train a neural network architecture using metadata of the ETDs, abstract information as well as attempting to see if adding full text data helps in the classification task.
- Big Data Text Summarization (Fall 2018):
- Ashish Baghudana's text summarization project
- Neural-ParsCit
- Deepfigures-open Github
A lot of techniques that exist for processing digital documents do not extend well to book length documents such as theses and dissertations. Thus, there is a need to develop techniques that are capable of extracting information from book length documents.
-
Citation Parsing: As part of the project, we will aim to accurately extract citations from ETDs using various NLP tools. Furthermore, we aim to identify particular pieces of information within the citations such as the author names. Ideally, we hope to use and adapted Neural-ParsCit to accomplish these tasks.
-
Figure and caption extraction: As part of the project, we aim to accurately extract the figures, tables and the corresponding captions from our collections of ETD. Ideally we hope to use and adapt DeepFigures to accomplish to this task.
-
Categorization: As part of the project, we aim to perform multi-class classification of ETD documents using the ProQuest subject categories as the target classification system.
Virginia Tech collection of ETDs
- Slack team: ETD_VT_DLRL
- GitHub: https://github.com/waingram/CS6604-ETD