-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add documentation for transform and visualise
- Loading branch information
1 parent
b6e5f88
commit 41ed6f7
Showing
19 changed files
with
203 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
--- | ||
title: Assistant enabled e-discovery | ||
aside: false | ||
--- | ||
|
||
::: tip Info | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,110 @@ | ||
--- | ||
title: Transforming Content | ||
aside: false | ||
--- | ||
|
||
::: tip Info | ||
|
||
The tools in this section require a [registration with describo.cloud](/docs/guide/register) and | ||
credits for the assistant, text extraction and named entity recognition capabilities. | ||
|
||
::: | ||
|
||
# Transforming Content | ||
|
||
Coming soon. | ||
For projects dealing with digitised content, a common requirement is the ability to perform bulk | ||
transformations on the raw data. This can include converting images into formats suitable for | ||
display on the web (e.g. creating thumbnails; converting TIFF images to jpeg or webp) as well as | ||
performing content extraction (e.g. performing named entity recognition and extracting themes, | ||
topics and subjects). The Describo Transform section provides tools to enable this work. | ||
|
||
When you open the transform section, you will see a file browser in the left panel. Select the files | ||
and folders that you want to process. Describo will traverse the folder structure looking for images | ||
and text files (specifically, tiff, jpg, png, txt and html files). If any are found they will be | ||
selected and added to the context. | ||
|
||
Then, controls will be shown in the right hand panel to process the different file types. In the | ||
following image we can see the credits available for the assistant, text extraction and entity | ||
recognition followed by controls to handle processing the different file types discovered. | ||
|
||
<ImageComponent src="/images/guide-transform/transform1.webp"></ImageComponent> | ||
|
||
## Image Processing tasks | ||
|
||
### Create Image Thumbnails and Webformats. | ||
|
||
Setting this toggle to true will enable the production of thumbnails and webformats. You can also | ||
adjust the size of the thumbnail here with an actual sized preview displayed to the right. | ||
|
||
::: tip Info | ||
|
||
- Thumbnails are named as the source file and stored in a folder named `thumbnails` adjacent to | ||
the image. | ||
- Webformats are named as the source file (with a different extention) and stored in a folder | ||
named `webformats` adjacent to the image. | ||
|
||
::: | ||
|
||
When a tiff image is encountered, both a jpg and a webp is produced. The jpg is required to send the | ||
contents for text extraction and the webp format is for display on the web. All other formats | ||
produce webp only. Webp is supported by all browsers and is their format of choice for web content | ||
delivery. | ||
|
||
### Extract Text | ||
|
||
Enabling this will perform text extraction on each image. | ||
|
||
::: tip Info | ||
|
||
- The extracted text is written to a HTML file named as the source file and stored in a folder | ||
`transcriptions` adjacent to the source image. | ||
|
||
::: | ||
|
||
#### Perform named entity recognition | ||
|
||
If enabled, the extracted text will also be run through the named entity recognition tools. When | ||
named entity recognition runs, the HTML transcription file is marked up with data attributes and the | ||
marked up entities are set as unconfirmed. In the transcribe section there are controls to review | ||
and markup the discovered entities as confirmed. | ||
|
||
#### Automatically confirm recognised entities | ||
|
||
Enabling this will automatically confirm all named entities. In addition to marking them up in the | ||
HTML transcription file, the entities will be written into the metadata file against the original | ||
source file. | ||
|
||
So, for example, if a person named "Jane Doe" is disocvered, an entity of type person will be | ||
automatically created and associated to the source file via the "mentions" property. | ||
|
||
#### Perform topic, theme and subject extraction | ||
|
||
If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text | ||
and write those into the metadata against that file. | ||
|
||
## Text file processing tasks | ||
|
||
### Perform named entity recognition | ||
|
||
If enabled, the text will be run through the named entity recognition tools. In these cases, as | ||
Describo does not alter the original file, the data cannot be marked up there. Furthermore, it | ||
doesn't make sense to produce a linked HTML version of that content so in these cases the entities | ||
will be written into the metadata file and linked to the source file. | ||
|
||
### Perform topic, theme and subject extraction | ||
|
||
If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text | ||
and write those into the metadata against that file. | ||
|
||
::: info Question | ||
|
||
Why are PDF and Microsoft Word documents not available for processing? | ||
|
||
Be definition, these formats are multiple pages long so processing them as one large blob of text | ||
would just result in a metadata entry with potentially thousands of entities linked from it. | ||
|
||
In future, it is envisaged that these tools would evolve to treat those files as a set of linked | ||
pages so that entity recognition and topic, theme, subject extraction operates at the page level; | ||
thus producing a linked data structure flowing through the document. | ||
|
||
::: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,104 @@ | ||
--- | ||
title: Visualising the Linked Data Structure | ||
aside: false | ||
--- | ||
|
||
# Visualising the Linked Data Structure | ||
|
||
Coming soon. | ||
::: tip Info | ||
|
||
Apart from producing nice visualisations, this section of Describo goes hand in hand with the | ||
[Discover](/docs/guide/assistant-supported-discovery) section in your digital discovery process. | ||
That is, you can use Discover to deep dive into the data for specific information that you are | ||
interested in or know exists. Or, you can use these tools to get an overview of what you actually | ||
have as the first part of the process of understanding your data. | ||
|
||
::: | ||
|
||
## Introduction | ||
|
||
Describo is built around the creation and management of linked data. | ||
|
||
::: tip Info | ||
|
||
Linked data is defined in Wikipedia | ||
[...as structured data which is interlinked with other data..](https://en.wikipedia.org/wiki/Linked_data) | ||
(there is a lot more to it than that but for our purposes right now, that's all we need). | ||
|
||
::: | ||
|
||
In the following image, we can see what that means. | ||
|
||
Following we see two nodes (circles) that represent two interlinked entities. One is an entity of | ||
type `CreativeWork named ro-crate-metadata.json`. The other is of type | ||
`Dataset and named My Research Object Crate`. The two nodes (or entities, used interchangably) are | ||
linked to each via a property `about`. This example is telling us that | ||
`the CreativeWork is about the interlinked Dataset`. | ||
|
||
<ImageComponent src="/images/guide-visualise/visualise1.webp"></ImageComponent> | ||
|
||
In this way, we can model complex relationships between named entities and then go on to describe | ||
them further. | ||
|
||
## The visualise section | ||
|
||
In the example above we already saw the visualise section. When you first navigate to it it looks as | ||
follows. | ||
|
||
<ImageComponent src="/images/guide-visualise/visualise2.webp"></ImageComponent> | ||
|
||
The blank area to the left is where the network visualisation will be displayed and the right | ||
sidebar has controls to interact with the visualisation. | ||
|
||
Get started by pressing the | ||
<span class="text-sm bg-blue-500 text-white py-1 px-2 rounded"><FontAwesomeIcon :icon="faPlay" /></span> | ||
button at the top of the controls. | ||
|
||
When you do, the network structure in the metadata will be displayed. Following is a visualisation | ||
of the entities, topics, themes and subjects in the | ||
[Taylor Swift Song Dataset](/docs/articles/taytay-sings-the-budget-blues). | ||
|
||
<ImageComponent src="/images/guide-visualise/visualise3.webp"></ImageComponent> | ||
|
||
There's a lot going on so let's break it down. | ||
|
||
- The first step involves extracting the entities, topics, themes and subjects from the data. This | ||
was done using the controls in the [Transform](/docs/guide/transforming-content) section of | ||
Describo. As the source data (the song files) are plain text files, all of the content extracted | ||
from them was marked up in the metadata directly. | ||
- In the controls we see that there are 1180 nodes (entities) and 1793 edges (connections between | ||
entities). | ||
- Describo has assigned default colours to the main entity types. However, using the controls in | ||
the **Styling** section at the bottom of the controls, we can choose to recolour the | ||
visualisation. If, for example, we wished to focus on the relationship between Songs and Topic, | ||
we might colour only those node types, viz: | ||
|
||
<ImageComponent src="/images/guide-visualise/visualise4.webp"></ImageComponent> | ||
|
||
Browsing around the graph we can inspect those relationships in more detail. In the following image | ||
we can see that `Betrayal` is a topic in six songs. | ||
|
||
<ImageComponent src="/images/guide-visualise/visualise5.webp"></ImageComponent> | ||
|
||
## Controls | ||
|
||
A brief overview of the controls was presented in the previous section but there is more that you | ||
can do. Following is an overview: | ||
|
||
- The **Settings** section provides toggles to enable / disable various features. In addition, the | ||
node and edge text size can be changed. These controls will help you navigating and exploring | ||
very large graphs and when taking screenshots of | ||
|
||
- The **Highlight node** section has a control to see the most highly linked entities in | ||
descending order. In a discovery process, seeing the most highly linked topics, themes and | ||
subjects can provide insight into where to look for more detail. With this control, you can | ||
selectively highlight the most interlinked entity. In the following image we see that the Topic | ||
`Relationships` is discussed in 38 of Taylor's songs. Not surprising really. But we can then | ||
find that the next most discussed topic is `Romantic Relationship` in 13 songs. | ||
|
||
<ImageComponent src="/images/guide-visualise/visualise6.webp"></ImageComponent> | ||
<ImageComponent src="/images/guide-visualise/visualise7.webp"></ImageComponent> | ||
|
||
<script setup> | ||
import { faPlay } from "@fortawesome/free-solid-svg-icons"; | ||
</script> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.