Skip to content

Commit

Permalink
add documentation for transform and visualise
Browse files Browse the repository at this point in the history
  • Loading branch information
marcolarosa committed Jul 15, 2024
1 parent b6e5f88 commit 41ed6f7
Show file tree
Hide file tree
Showing 19 changed files with 203 additions and 2 deletions.
1 change: 1 addition & 0 deletions src/docs/guide/assistant-supported-discovery.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Assistant enabled e-discovery
aside: false
---

::: tip Info
Expand Down
105 changes: 104 additions & 1 deletion src/docs/guide/transforming-content.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,110 @@
---
title: Transforming Content
aside: false
---

::: tip Info

The tools in this section require a [registration with describo.cloud](/docs/guide/register) and
credits for the assistant, text extraction and named entity recognition capabilities.

:::

# Transforming Content

Coming soon.
For projects dealing with digitised content, a common requirement is the ability to perform bulk
transformations on the raw data. This can include converting images into formats suitable for
display on the web (e.g. creating thumbnails; converting TIFF images to jpeg or webp) as well as
performing content extraction (e.g. performing named entity recognition and extracting themes,
topics and subjects). The Describo Transform section provides tools to enable this work.

When you open the transform section, you will see a file browser in the left panel. Select the files
and folders that you want to process. Describo will traverse the folder structure looking for images
and text files (specifically, tiff, jpg, png, txt and html files). If any are found they will be
selected and added to the context.

Then, controls will be shown in the right hand panel to process the different file types. In the
following image we can see the credits available for the assistant, text extraction and entity
recognition followed by controls to handle processing the different file types discovered.

<ImageComponent src="/images/guide-transform/transform1.webp"></ImageComponent>

## Image Processing tasks

### Create Image Thumbnails and Webformats.

Setting this toggle to true will enable the production of thumbnails and webformats. You can also
adjust the size of the thumbnail here with an actual sized preview displayed to the right.

::: tip Info

- Thumbnails are named as the source file and stored in a folder named `thumbnails` adjacent to
the image.
- Webformats are named as the source file (with a different extention) and stored in a folder
named `webformats` adjacent to the image.

:::

When a tiff image is encountered, both a jpg and a webp is produced. The jpg is required to send the
contents for text extraction and the webp format is for display on the web. All other formats
produce webp only. Webp is supported by all browsers and is their format of choice for web content
delivery.

### Extract Text

Enabling this will perform text extraction on each image.

::: tip Info

- The extracted text is written to a HTML file named as the source file and stored in a folder
`transcriptions` adjacent to the source image.

:::

#### Perform named entity recognition

If enabled, the extracted text will also be run through the named entity recognition tools. When
named entity recognition runs, the HTML transcription file is marked up with data attributes and the
marked up entities are set as unconfirmed. In the transcribe section there are controls to review
and markup the discovered entities as confirmed.

#### Automatically confirm recognised entities

Enabling this will automatically confirm all named entities. In addition to marking them up in the
HTML transcription file, the entities will be written into the metadata file against the original
source file.

So, for example, if a person named "Jane Doe" is disocvered, an entity of type person will be
automatically created and associated to the source file via the "mentions" property.

#### Perform topic, theme and subject extraction

If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text
and write those into the metadata against that file.

## Text file processing tasks

### Perform named entity recognition

If enabled, the text will be run through the named entity recognition tools. In these cases, as
Describo does not alter the original file, the data cannot be marked up there. Furthermore, it
doesn't make sense to produce a linked HTML version of that content so in these cases the entities
will be written into the metadata file and linked to the source file.

### Perform topic, theme and subject extraction

If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text
and write those into the metadata against that file.

::: info Question

Why are PDF and Microsoft Word documents not available for processing?

Be definition, these formats are multiple pages long so processing them as one large blob of text
would just result in a metadata entry with potentially thousands of entities linked from it.

In future, it is envisaged that these tools would evolve to treat those files as a set of linked
pages so that entity recognition and topic, theme, subject extraction operates at the page level;
thus producing a linked data structure flowing through the document.

:::
99 changes: 98 additions & 1 deletion src/docs/guide/visualising-the-structure.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,104 @@
---
title: Visualising the Linked Data Structure
aside: false
---

# Visualising the Linked Data Structure

Coming soon.
::: tip Info

Apart from producing nice visualisations, this section of Describo goes hand in hand with the
[Discover](/docs/guide/assistant-supported-discovery) section in your digital discovery process.
That is, you can use Discover to deep dive into the data for specific information that you are
interested in or know exists. Or, you can use these tools to get an overview of what you actually
have as the first part of the process of understanding your data.

:::

## Introduction

Describo is built around the creation and management of linked data.

::: tip Info

Linked data is defined in Wikipedia
[...as structured data which is interlinked with other data..](https://en.wikipedia.org/wiki/Linked_data)
(there is a lot more to it than that but for our purposes right now, that's all we need).

:::

In the following image, we can see what that means.

Following we see two nodes (circles) that represent two interlinked entities. One is an entity of
type `CreativeWork named ro-crate-metadata.json`. The other is of type
`Dataset and named My Research Object Crate`. The two nodes (or entities, used interchangably) are
linked to each via a property `about`. This example is telling us that
`the CreativeWork is about the interlinked Dataset`.

<ImageComponent src="/images/guide-visualise/visualise1.webp"></ImageComponent>

In this way, we can model complex relationships between named entities and then go on to describe
them further.

## The visualise section

In the example above we already saw the visualise section. When you first navigate to it it looks as
follows.

<ImageComponent src="/images/guide-visualise/visualise2.webp"></ImageComponent>

The blank area to the left is where the network visualisation will be displayed and the right
sidebar has controls to interact with the visualisation.

Get started by pressing the
<span class="text-sm bg-blue-500 text-white py-1 px-2 rounded"><FontAwesomeIcon :icon="faPlay" /></span>
button at the top of the controls.

When you do, the network structure in the metadata will be displayed. Following is a visualisation
of the entities, topics, themes and subjects in the
[Taylor Swift Song Dataset](/docs/articles/taytay-sings-the-budget-blues).

<ImageComponent src="/images/guide-visualise/visualise3.webp"></ImageComponent>

There's a lot going on so let's break it down.

- The first step involves extracting the entities, topics, themes and subjects from the data. This
was done using the controls in the [Transform](/docs/guide/transforming-content) section of
Describo. As the source data (the song files) are plain text files, all of the content extracted
from them was marked up in the metadata directly.
- In the controls we see that there are 1180 nodes (entities) and 1793 edges (connections between
entities).
- Describo has assigned default colours to the main entity types. However, using the controls in
the **Styling** section at the bottom of the controls, we can choose to recolour the
visualisation. If, for example, we wished to focus on the relationship between Songs and Topic,
we might colour only those node types, viz:

<ImageComponent src="/images/guide-visualise/visualise4.webp"></ImageComponent>

Browsing around the graph we can inspect those relationships in more detail. In the following image
we can see that `Betrayal` is a topic in six songs.

<ImageComponent src="/images/guide-visualise/visualise5.webp"></ImageComponent>

## Controls

A brief overview of the controls was presented in the previous section but there is more that you
can do. Following is an overview:

- The **Settings** section provides toggles to enable / disable various features. In addition, the
node and edge text size can be changed. These controls will help you navigating and exploring
very large graphs and when taking screenshots of

- The **Highlight node** section has a control to see the most highly linked entities in
descending order. In a discovery process, seeing the most highly linked topics, themes and
subjects can provide insight into where to look for more detail. With this control, you can
selectively highlight the most interlinked entity. In the following image we see that the Topic
`Relationships` is discussed in 38 of Taylor's songs. Not surprising really. But we can then
find that the next most discussed topic is `Romantic Relationship` in 13 songs.

<ImageComponent src="/images/guide-visualise/visualise6.webp"></ImageComponent>
<ImageComponent src="/images/guide-visualise/visualise7.webp"></ImageComponent>

<script setup>
import { faPlay } from "@fortawesome/free-solid-svg-icons";
</script>
Binary file added src/public/images/guide-transform/transform1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added src/public/images/guide-visualise/visualise1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added src/public/images/guide-visualise/visualise2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/public/images/guide-visualise/visualise2.webp
Binary file not shown.
Binary file added src/public/images/guide-visualise/visualise3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added src/public/images/guide-visualise/visualise4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added src/public/images/guide-visualise/visualise5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added src/public/images/guide-visualise/visualise6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added src/public/images/guide-visualise/visualise7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.

0 comments on commit 41ed6f7

Please sign in to comment.