Skip to content

The Unified Corpus Explorer (UCE) Search Portal and Importer.

License

Notifications You must be signed in to change notification settings

texttechnologylab/UCE

Repository files navigation

Unified Corpus Explorer

Making UIMA-annotated corpora tangible, searchable and vivid.


We introduce the Unified Corpus Explorer (UCE), a standardized, dockerized, and dynamic Natural Language Processing (NLP) application designed for flexible and scalable corpus navigation. Herein, UCE utilizes the UIMA format for NLP annotations as a standardized input, constructing interfaces and features around those annotations while dynamically adapting to the corpora and their extracted annotations.
Documentation Demo Languages: - German | English Text Technology Lab

UCE-Explained.mp4

Quick Start

Clone this repository:

git clone https://github.com/texttechnologylab/UCE.git

Start the docker containers:

docker-compose up

The web instance, by deafult, is reachable under: http://localhost:8008. If you're looking for a small demo without creating it yourself, please check our open demo.

We are currently creating a dedicated Documentation Page which will be up soon to explain the configuration in more detail and how you can customize UCE.

About

UCE is customizable in terms of annotations imported, corporate identity used, and background information added. It allows the creation of a specific UCE instance for your project, regardless of the domain. It does so by utilizing UIMA-annotated corpora, with the primary tool for creating those being the Docker Unified UIMA Interface (DUUI). Hence, you would gather your corpus, use DUUI to annotate whatever you want to annotate, and finally import those annotations into UCE to host them.

Microservices

UCE consists of several microservices, each dockerized and utilizing distinct technologies, which is being outlined in the following:


Microservice Description
A: Corpus-Importer UCE is based on Corpus-Importer, a Java application that reads UIMA-annotated documents from a specified path, along with a corresponding corpus-configuration JSON file. The importer extracts the raw data and the configured annotations, applying its own post-processing to set up the environment, which includes text segmentation, database indexing, keyword extraction, and the creation of various embedding spaces, before finally storing each processed document in a PostgreSQL database (B).
B: Relational Database As our primary database, we opted for a relational PostgreSQL database, as UCE requires a structured and standardized database schema that can be extended if necessary. Additionally, its compatibility with the pgvector extension enables efficient vector operations directly within the database engine. This allows us to store high-dimensional vector embeddings within relational data tables while also enabling fast vector operations and searches.
C: Graph Database In addition to a relational database (B), UCE utilizes an Apache Jena SPARQL database to incorporate basic semantic searches in the Resource Description Framework (RDF) and Web Ontology Language (OWL) data formats. This integration enables the incorporation of domain-specific ontologies (e.g., biological taxonomy) into the UCE environment, further enriching its search capabilities.
D: Python Webserver Within UCE, we also utilize a Python web service to provide an interface to machine learning and AI models, as these are primarily accessible through Python. In this context, the web server facilitates access to the generation of embedding vectors, their dimensionality reduction methods, such as t-SNE and PCA, and the inference of (Large) Language Models. The web server is accessible via a REST API and is utilized by services (A) and (E).
E: UCE Web Portal The user interacts with UCE and all of its features through a web portal implemented in Java. This service communicates with all other services except for (B), providing a variety of search methods, visualization features, and different ways to interact with the underlying information units, as outlined in detail in Section 3.2.

In Medias Res

Some, but not all of the search and visualization features within UCE:

Annotations

Currently supported annotations within UCE are outlined in the following table:

Annotation Description
Sentence Divides the documents into their respective sentences.
Named-Entity Extracts named entities from a document, categorizing them into four types: organization (ORG), person (PER), location (LOC), and miscellaneous (MISC).
Lemma and POS Lemmatization reduces inflected words to their root form. Within UCE, searches are enhanced by considering these root forms.
Semantic Role Labels (SRL) SRL identifies semantic relations between the lexical constituents of a sentence, assigning labels to words or phrases that indicate their semantic roles, such as agent, goal, or result.
Time Extracts temporal expressions, including time and date formats, from a document, analogous to Named-Entity Recognition tasks.
Taxon The recognition of unambiguous names of biological entities is referred to as a taxon.
WikiLinks Maps potential words and phrases to their corresponding Wikidata URLs, facilitating the retrieval and access of additional information.
OCR Since much of the literature has yet to be digitized, UCE provides support for corpora containing documents that have undergone Optical Character Recognition (OCR) extraction. These annotations assist in reconstructing the physical layout of the pages within UCE.