diff --git a/src/docs/guide/assistant-supported-discovery.md b/src/docs/guide/assistant-supported-discovery.md index 9ac7aba..fb084f4 100644 --- a/src/docs/guide/assistant-supported-discovery.md +++ b/src/docs/guide/assistant-supported-discovery.md @@ -1,5 +1,6 @@ --- title: Assistant enabled e-discovery +aside: false --- ::: tip Info diff --git a/src/docs/guide/transforming-content.md b/src/docs/guide/transforming-content.md index 9f29eb3..2caf8e4 100644 --- a/src/docs/guide/transforming-content.md +++ b/src/docs/guide/transforming-content.md @@ -1,7 +1,110 @@ --- title: Transforming Content +aside: false --- +::: tip Info + +The tools in this section require a [registration with describo.cloud](/docs/guide/register) and +credits for the assistant, text extraction and named entity recognition capabilities. + +::: + # Transforming Content -Coming soon. +For projects dealing with digitised content, a common requirement is the ability to perform bulk +transformations on the raw data. This can include converting images into formats suitable for +display on the web (e.g. creating thumbnails; converting TIFF images to jpeg or webp) as well as +performing content extraction (e.g. performing named entity recognition and extracting themes, +topics and subjects). The Describo Transform section provides tools to enable this work. + +When you open the transform section, you will see a file browser in the left panel. Select the files +and folders that you want to process. Describo will traverse the folder structure looking for images +and text files (specifically, tiff, jpg, png, txt and html files). If any are found they will be +selected and added to the context. + +Then, controls will be shown in the right hand panel to process the different file types. In the +following image we can see the credits available for the assistant, text extraction and entity +recognition followed by controls to handle processing the different file types discovered. + + + +## Image Processing tasks + +### Create Image Thumbnails and Webformats. + +Setting this toggle to true will enable the production of thumbnails and webformats. You can also +adjust the size of the thumbnail here with an actual sized preview displayed to the right. + +::: tip Info + +- Thumbnails are named as the source file and stored in a folder named `thumbnails` adjacent to + the image. +- Webformats are named as the source file (with a different extention) and stored in a folder + named `webformats` adjacent to the image. + +::: + +When a tiff image is encountered, both a jpg and a webp is produced. The jpg is required to send the +contents for text extraction and the webp format is for display on the web. All other formats +produce webp only. Webp is supported by all browsers and is their format of choice for web content +delivery. + +### Extract Text + +Enabling this will perform text extraction on each image. + +::: tip Info + +- The extracted text is written to a HTML file named as the source file and stored in a folder + `transcriptions` adjacent to the source image. + +::: + +#### Perform named entity recognition + +If enabled, the extracted text will also be run through the named entity recognition tools. When +named entity recognition runs, the HTML transcription file is marked up with data attributes and the +marked up entities are set as unconfirmed. In the transcribe section there are controls to review +and markup the discovered entities as confirmed. + +#### Automatically confirm recognised entities + +Enabling this will automatically confirm all named entities. In addition to marking them up in the +HTML transcription file, the entities will be written into the metadata file against the original +source file. + +So, for example, if a person named "Jane Doe" is disocvered, an entity of type person will be +automatically created and associated to the source file via the "mentions" property. + +#### Perform topic, theme and subject extraction + +If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text +and write those into the metadata against that file. + +## Text file processing tasks + +### Perform named entity recognition + +If enabled, the text will be run through the named entity recognition tools. In these cases, as +Describo does not alter the original file, the data cannot be marked up there. Furthermore, it +doesn't make sense to produce a linked HTML version of that content so in these cases the entities +will be written into the metadata file and linked to the source file. + +### Perform topic, theme and subject extraction + +If enabled, the assistant will extract the top 3 topics, themes and subjects from the extracted text +and write those into the metadata against that file. + +::: info Question + +Why are PDF and Microsoft Word documents not available for processing? + +Be definition, these formats are multiple pages long so processing them as one large blob of text +would just result in a metadata entry with potentially thousands of entities linked from it. + +In future, it is envisaged that these tools would evolve to treat those files as a set of linked +pages so that entity recognition and topic, theme, subject extraction operates at the page level; +thus producing a linked data structure flowing through the document. + +::: diff --git a/src/docs/guide/visualising-the-structure.md b/src/docs/guide/visualising-the-structure.md index ec1b6ef..3340efd 100644 --- a/src/docs/guide/visualising-the-structure.md +++ b/src/docs/guide/visualising-the-structure.md @@ -1,7 +1,104 @@ --- title: Visualising the Linked Data Structure +aside: false --- # Visualising the Linked Data Structure -Coming soon. +::: tip Info + +Apart from producing nice visualisations, this section of Describo goes hand in hand with the +[Discover](/docs/guide/assistant-supported-discovery) section in your digital discovery process. +That is, you can use Discover to deep dive into the data for specific information that you are +interested in or know exists. Or, you can use these tools to get an overview of what you actually +have as the first part of the process of understanding your data. + +::: + +## Introduction + +Describo is built around the creation and management of linked data. + +::: tip Info + +Linked data is defined in Wikipedia +[...as structured data which is interlinked with other data..](https://en.wikipedia.org/wiki/Linked_data) +(there is a lot more to it than that but for our purposes right now, that's all we need). + +::: + +In the following image, we can see what that means. + +Following we see two nodes (circles) that represent two interlinked entities. One is an entity of +type `CreativeWork named ro-crate-metadata.json`. The other is of type +`Dataset and named My Research Object Crate`. The two nodes (or entities, used interchangably) are +linked to each via a property `about`. This example is telling us that +`the CreativeWork is about the interlinked Dataset`. + + + +In this way, we can model complex relationships between named entities and then go on to describe +them further. + +## The visualise section + +In the example above we already saw the visualise section. When you first navigate to it it looks as +follows. + + + +The blank area to the left is where the network visualisation will be displayed and the right +sidebar has controls to interact with the visualisation. + +Get started by pressing the + +button at the top of the controls. + +When you do, the network structure in the metadata will be displayed. Following is a visualisation +of the entities, topics, themes and subjects in the +[Taylor Swift Song Dataset](/docs/articles/taytay-sings-the-budget-blues). + + + +There's a lot going on so let's break it down. + +- The first step involves extracting the entities, topics, themes and subjects from the data. This + was done using the controls in the [Transform](/docs/guide/transforming-content) section of + Describo. As the source data (the song files) are plain text files, all of the content extracted + from them was marked up in the metadata directly. +- In the controls we see that there are 1180 nodes (entities) and 1793 edges (connections between + entities). +- Describo has assigned default colours to the main entity types. However, using the controls in + the **Styling** section at the bottom of the controls, we can choose to recolour the + visualisation. If, for example, we wished to focus on the relationship between Songs and Topic, + we might colour only those node types, viz: + + + +Browsing around the graph we can inspect those relationships in more detail. In the following image +we can see that `Betrayal` is a topic in six songs. + + + +## Controls + +A brief overview of the controls was presented in the previous section but there is more that you +can do. Following is an overview: + +- The **Settings** section provides toggles to enable / disable various features. In addition, the + node and edge text size can be changed. These controls will help you navigating and exploring + very large graphs and when taking screenshots of + +- The **Highlight node** section has a control to see the most highly linked entities in + descending order. In a discovery process, seeing the most highly linked topics, themes and + subjects can provide insight into where to look for more detail. With this control, you can + selectively highlight the most interlinked entity. In the following image we see that the Topic + `Relationships` is discussed in 38 of Taylor's songs. Not surprising really. But we can then + find that the next most discussed topic is `Romantic Relationship` in 13 songs. + + + + + diff --git a/src/public/images/guide-transform/transform1.png b/src/public/images/guide-transform/transform1.png new file mode 100644 index 0000000..01b718f Binary files /dev/null and b/src/public/images/guide-transform/transform1.png differ diff --git a/src/public/images/guide-transform/transform1.webp b/src/public/images/guide-transform/transform1.webp new file mode 100644 index 0000000..42bb643 Binary files /dev/null and b/src/public/images/guide-transform/transform1.webp differ diff --git a/src/public/images/guide-visualise/visualise1.png b/src/public/images/guide-visualise/visualise1.png new file mode 100644 index 0000000..8e990f6 Binary files /dev/null and b/src/public/images/guide-visualise/visualise1.png differ diff --git a/src/public/images/guide-visualise/visualise1.webp b/src/public/images/guide-visualise/visualise1.webp new file mode 100644 index 0000000..344a53d Binary files /dev/null and b/src/public/images/guide-visualise/visualise1.webp differ diff --git a/src/public/images/guide-visualise/visualise2.png b/src/public/images/guide-visualise/visualise2.png new file mode 100644 index 0000000..0d52f02 Binary files /dev/null and b/src/public/images/guide-visualise/visualise2.png differ diff --git a/src/public/images/guide-visualise/visualise2.webp b/src/public/images/guide-visualise/visualise2.webp new file mode 100644 index 0000000..a723a9f Binary files /dev/null and b/src/public/images/guide-visualise/visualise2.webp differ diff --git a/src/public/images/guide-visualise/visualise3.png b/src/public/images/guide-visualise/visualise3.png new file mode 100644 index 0000000..90fe1de Binary files /dev/null and b/src/public/images/guide-visualise/visualise3.png differ diff --git a/src/public/images/guide-visualise/visualise3.webp b/src/public/images/guide-visualise/visualise3.webp new file mode 100644 index 0000000..fc3de10 Binary files /dev/null and b/src/public/images/guide-visualise/visualise3.webp differ diff --git a/src/public/images/guide-visualise/visualise4.png b/src/public/images/guide-visualise/visualise4.png new file mode 100644 index 0000000..5930f33 Binary files /dev/null and b/src/public/images/guide-visualise/visualise4.png differ diff --git a/src/public/images/guide-visualise/visualise4.webp b/src/public/images/guide-visualise/visualise4.webp new file mode 100644 index 0000000..83dc3e1 Binary files /dev/null and b/src/public/images/guide-visualise/visualise4.webp differ diff --git a/src/public/images/guide-visualise/visualise5.png b/src/public/images/guide-visualise/visualise5.png new file mode 100644 index 0000000..2ff9d51 Binary files /dev/null and b/src/public/images/guide-visualise/visualise5.png differ diff --git a/src/public/images/guide-visualise/visualise5.webp b/src/public/images/guide-visualise/visualise5.webp new file mode 100644 index 0000000..a7a64ed Binary files /dev/null and b/src/public/images/guide-visualise/visualise5.webp differ diff --git a/src/public/images/guide-visualise/visualise6.png b/src/public/images/guide-visualise/visualise6.png new file mode 100644 index 0000000..242bd4c Binary files /dev/null and b/src/public/images/guide-visualise/visualise6.png differ diff --git a/src/public/images/guide-visualise/visualise6.webp b/src/public/images/guide-visualise/visualise6.webp new file mode 100644 index 0000000..c50185a Binary files /dev/null and b/src/public/images/guide-visualise/visualise6.webp differ diff --git a/src/public/images/guide-visualise/visualise7.png b/src/public/images/guide-visualise/visualise7.png new file mode 100644 index 0000000..12e4d2f Binary files /dev/null and b/src/public/images/guide-visualise/visualise7.png differ diff --git a/src/public/images/guide-visualise/visualise7.webp b/src/public/images/guide-visualise/visualise7.webp new file mode 100644 index 0000000..93676c6 Binary files /dev/null and b/src/public/images/guide-visualise/visualise7.webp differ