add documentation for transform and visualise

describo · Jul 15, 2024 · 41ed6f7 · 41ed6f7
1 parent b6e5f88
commit 41ed6f7
Show file tree

Hide file tree

Showing 19 changed files with 203 additions and 2 deletions.
diff --git a/src/docs/guide/assistant-supported-discovery.md b/src/docs/guide/assistant-supported-discovery.md
@@ -1,5 +1,6 @@
 ---
 title: Assistant enabled e-discovery
+aside: false
 ---
 
 ::: tip Info

diff --git a/src/docs/guide/transforming-content.md b/src/docs/guide/transforming-content.md
@@ -1,7 +1,110 @@
 ---
 title: Transforming Content
+aside: false
 ---
 
+::: tip Info
+
+The tools in this section require a [registration with describo.cloud](/docs/guide/register) and
+credits for the assistant, text extraction and named entity recognition capabilities.
+
+:::
+
 # Transforming Content
 
-Coming soon.
+For projects dealing with digitised content, a common requirement is the ability to perform bulk
+transformations on the raw data. This can include converting images into formats suitable for
+display on the web (e.g. creating thumbnails; converting TIFF images to jpeg or webp) as well as
+performing content extraction (e.g. performing named entity recognition and extracting themes,
+topics and subjects). The Describo Transform section provides tools to enable this work.
+
+When you open the transform section, you will see a file browser in the left panel. Select the files
+and folders that you want to process. Describo will traverse the folder structure looking for images
+and text files (specifically, tiff, jpg, png, txt and html files). If any are found they will be
+selected and added to the context.
+
+Then, controls will be shown in the right hand panel to process the different file types. In the
+following image we can see the credits available for the assistant, text extraction and entity
+recognition followed by controls to handle processing the different file types discovered.
+
+<ImageComponent src="/images/guide-transform/transform1.webp"></ImageComponent>
+
+## Image Processing tasks
+
+### Create Image Thumbnails and Webformats.
+
+Setting this toggle to true will enable the production of thumbnails and webformats. You can also
+adjust the size of the thumbnail here with an actual sized preview displayed to the right.
+
+::: tip Info
+
+-   Thumbnails are named as the source file and stored in a folder named `thumbnails` adjacent to
+    the image.
+-   Webformats are named as the source file (with a different extention) and stored in a folder
+    named `webformats` adjacent to the image.
+
+:::
+
+When a tiff image is encountered, both a jpg and a webp is produced. The jpg is required to send the
+contents for text extraction and the webp format is for display on the web. All other formats
+produce webp only. Webp is supported by all browsers and is their format of choice for web content
+delivery.
+
+### Extract Text
+
+Enabling this will perform text extraction on each image.
+
+::: tip Info
+
+-   The extracted text is written to a HTML file named as the source file and stored in a folder
+    `transcriptions` adjacent to the source image.
+
+:::
+
+#### Perform named entity recognition
+
+If enabled, the extracted text will also be run through the named entity recognition tools. When
+named entity recognition runs, the HTML transcription file is marked up with data attributes and the
+marked up entities are set as unconfirmed. In the transcribe section there are controls to review
+and markup the discovered entities as confirmed.
+
+#### Automatically confirm recognised entities
+
+Enabling this will automatically confirm all named entities. In addition to marking them up in the
+HTML transcription file, the entities will be written into the metadata file against the original
+source file.
+
+So, for example, if a person named "Jane Doe" is disocvered, an entity of type person will be
+automatically created and associated to the source file via the "mentions" property.
+
+#### Perform topic, theme and subject extraction
+
+If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text
+and write those into the metadata against that file.
+
+## Text file processing tasks
+
+### Perform named entity recognition
+
+If enabled, the text will be run through the named entity recognition tools. In these cases, as
+Describo does not alter the original file, the data cannot be marked up there. Furthermore, it
+doesn't make sense to produce a linked HTML version of that content so in these cases the entities
+will be written into the metadata file and linked to the source file.
+
+### Perform topic, theme and subject extraction
+
+If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text
+and write those into the metadata against that file.
+
+::: info Question
+
+Why are PDF and Microsoft Word documents not available for processing?
+
+Be definition, these formats are multiple pages long so processing them as one large blob of text
+would just result in a metadata entry with potentially thousands of entities linked from it.
+
+In future, it is envisaged that these tools would evolve to treat those files as a set of linked
+pages so that entity recognition and topic, theme, subject extraction operates at the page level;
+thus producing a linked data structure flowing through the document.
+
+:::
diff --git a/src/docs/guide/visualising-the-structure.md b/src/docs/guide/visualising-the-structure.md
@@ -1,7 +1,104 @@
 ---
 title: Visualising the Linked Data Structure
+aside: false
 ---
 
 # Visualising the Linked Data Structure
 
-Coming soon.
+::: tip Info
+
+Apart from producing nice visualisations, this section of Describo goes hand in hand with the
+[Discover](/docs/guide/assistant-supported-discovery) section in your digital discovery process.
+That is, you can use Discover to deep dive into the data for specific information that you are
+interested in or know exists. Or, you can use these tools to get an overview of what you actually
+have as the first part of the process of understanding your data.
+
+:::
+
+## Introduction
+
+Describo is built around the creation and management of linked data.
+
+::: tip Info
+
+Linked data is defined in Wikipedia
+[...as structured data which is interlinked with other data..](https://en.wikipedia.org/wiki/Linked_data)
+(there is a lot more to it than that but for our purposes right now, that's all we need).
+
+:::
+
+In the following image, we can see what that means.
+
+Following we see two nodes (circles) that represent two interlinked entities. One is an entity of
+type `CreativeWork named ro-crate-metadata.json`. The other is of type
+`Dataset and named My Research Object Crate`. The two nodes (or entities, used interchangably) are
+linked to each via a property `about`. This example is telling us that
+`the CreativeWork is about the interlinked Dataset`.
+
+<ImageComponent src="/images/guide-visualise/visualise1.webp"></ImageComponent>
+
+In this way, we can model complex relationships between named entities and then go on to describe
+them further.
+
+## The visualise section
+
+In the example above we already saw the visualise section. When you first navigate to it it looks as
+follows.
+
+<ImageComponent src="/images/guide-visualise/visualise2.webp"></ImageComponent>
+
+The blank area to the left is where the network visualisation will be displayed and the right
+sidebar has controls to interact with the visualisation.
+
+Get started by pressing the
+<span class="text-sm bg-blue-500 text-white py-1 px-2 rounded"><FontAwesomeIcon :icon="faPlay" /></span>
+button at the top of the controls.
+
+When you do, the network structure in the metadata will be displayed. Following is a visualisation
+of the entities, topics, themes and subjects in the
+[Taylor Swift Song Dataset](/docs/articles/taytay-sings-the-budget-blues).
+
+<ImageComponent src="/images/guide-visualise/visualise3.webp"></ImageComponent>
+
+There's a lot going on so let's break it down.
+
+-   The first step involves extracting the entities, topics, themes and subjects from the data. This
+    was done using the controls in the [Transform](/docs/guide/transforming-content) section of
+    Describo. As the source data (the song files) are plain text files, all of the content extracted
+    from them was marked up in the metadata directly.
+-   In the controls we see that there are 1180 nodes (entities) and 1793 edges (connections between
+    entities).
+-   Describo has assigned default colours to the main entity types. However, using the controls in
+    the **Styling** section at the bottom of the controls, we can choose to recolour the
+    visualisation. If, for example, we wished to focus on the relationship between Songs and Topic,
+    we might colour only those node types, viz:
+
+<ImageComponent src="/images/guide-visualise/visualise4.webp"></ImageComponent>
+
+Browsing around the graph we can inspect those relationships in more detail. In the following image
+we can see that `Betrayal` is a topic in six songs.
+
+<ImageComponent src="/images/guide-visualise/visualise5.webp"></ImageComponent>
+
+## Controls
+
+A brief overview of the controls was presented in the previous section but there is more that you
+can do. Following is an overview:
+
+-   The **Settings** section provides toggles to enable / disable various features. In addition, the
+    node and edge text size can be changed. These controls will help you navigating and exploring
+    very large graphs and when taking screenshots of
+
+-   The **Highlight node** section has a control to see the most highly linked entities in
+    descending order. In a discovery process, seeing the most highly linked topics, themes and
+    subjects can provide insight into where to look for more detail. With this control, you can
+    selectively highlight the most interlinked entity. In the following image we see that the Topic
+    `Relationships` is discussed in 38 of Taylor's songs. Not surprising really. But we can then
+    find that the next most discussed topic is `Romantic Relationship` in 13 songs.
+
+<ImageComponent src="/images/guide-visualise/visualise6.webp"></ImageComponent>
+<ImageComponent src="/images/guide-visualise/visualise7.webp"></ImageComponent>
+
+<script setup>
+   import { faPlay } from "@fortawesome/free-solid-svg-icons";
+</script>
diff --git a/src/public/images/guide-transform/transform1.png b/src/public/images/guide-transform/transform1.png
diff --git a/src/public/images/guide-transform/transform1.webp b/src/public/images/guide-transform/transform1.webp
diff --git a/src/public/images/guide-visualise/visualise1.png b/src/public/images/guide-visualise/visualise1.png
diff --git a/src/public/images/guide-visualise/visualise1.webp b/src/public/images/guide-visualise/visualise1.webp
diff --git a/src/public/images/guide-visualise/visualise2.png b/src/public/images/guide-visualise/visualise2.png
diff --git a/src/public/images/guide-visualise/visualise2.webp b/src/public/images/guide-visualise/visualise2.webp
diff --git a/src/public/images/guide-visualise/visualise3.png b/src/public/images/guide-visualise/visualise3.png
diff --git a/src/public/images/guide-visualise/visualise3.webp b/src/public/images/guide-visualise/visualise3.webp
diff --git a/src/public/images/guide-visualise/visualise4.png b/src/public/images/guide-visualise/visualise4.png
diff --git a/src/public/images/guide-visualise/visualise4.webp b/src/public/images/guide-visualise/visualise4.webp
diff --git a/src/public/images/guide-visualise/visualise5.png b/src/public/images/guide-visualise/visualise5.png
diff --git a/src/public/images/guide-visualise/visualise5.webp b/src/public/images/guide-visualise/visualise5.webp
diff --git a/src/public/images/guide-visualise/visualise6.png b/src/public/images/guide-visualise/visualise6.png
diff --git a/src/public/images/guide-visualise/visualise6.webp b/src/public/images/guide-visualise/visualise6.webp
diff --git a/src/public/images/guide-visualise/visualise7.png b/src/public/images/guide-visualise/visualise7.png
diff --git a/src/public/images/guide-visualise/visualise7.webp b/src/public/images/guide-visualise/visualise7.webp