Skip to content

Commit

Permalink
fix schema
Browse files Browse the repository at this point in the history
  • Loading branch information
khaled196 committed Mar 15, 2024
1 parent b84d9ff commit 1df0d78
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions topics/fair/tutorials/fair-metadata/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Table 2.1: The 15 FAIR Guiding Principles. Principles relating directly to metad

Metadata is information that describes your data - it is data about data.

The provision of ‘rich’ metadata is key to FAIR since it allows data to be found, and enables other researchers to interpret data appropriately. ‘Rich’ in this context refers to extensive metadata, often connecting data to other data or terms (even in other datasets), with [qualified references]("Qualified references: terms used to describe relationships to pieces of (meta)data.") specifying how they are connected.
The provision of ‘rich’ metadata is key to FAIR since it allows data to be found, and enables other researchers to interpret data appropriately. ‘Rich’ in this context refers to extensive metadata, often connecting data to other data or terms (even in other datasets), with **qualified references**, Qualified references: terms used to describe relationships to pieces of (meta)data., specifying how they are connected.

If a researcher is given access to a dataset (a spreadsheet in CSV format, for example), the data is not usable without meaningful column headings and context. For example, what is this data about? Is it part of a larger dataset and, if so, how is it related? What are the column headings representing, and what are the rows representing? Additionally, if the values in the individual cells are device or assay measurements, what device was used, and what assay?

Expand All @@ -89,11 +89,11 @@ Though this Figure 2.1 shows the difference between metadata and data, it does
![spreadsheet](../../images/figure2-1_human-welsh-cohort.png "A spreadsheet showing the relationship between data (orange) and metadata (blue)")


We are given only 2 pieces of [data provenance]("Data provenance: metadata describing the origin of a piece of data, including information such as version, original location of the data, and usually an audit trail up to the current version.") in this example (i) the study name “LONDON DIABETES COHORT” which could help discover other documentation about the project; (ii) the name of the person who presumably led the study. While these may help to search for further information, for a human user, this would still be fraught with issues since both names may not uniquely identify the study or its lead. For a machine agent, such as a script, interpretation is made difficult because the metadata is not clearly marked-up (represented formally with tags).
We are given only 2 pieces of **data provenance**, Data provenance: metadata describing the origin of a piece of data, including information such as version, original location of the data, and usually an audit trail up to the current version, in this example (i) the study name “LONDON DIABETES COHORT” which could help discover other documentation about the project; (ii) the name of the person who presumably led the study. While these may help to search for further information, for a human user, this would still be fraught with issues since both names may not uniquely identify the study or its lead. For a machine agent, such as a script, interpretation is made difficult because the metadata is not clearly marked-up (represented formally with tags).

To fix this, contact details for investigators could be included. Often [ORCID IDs](https://info.orcid.org/what-is-orcid/) associated with names are used since they uniquely identify an individual and do not pose problems when an investigator moves institutions and the original (email) address changes. Coupled with this, investigators can be formally assigned a role in the provenance metadata allowing people to contact the appropriate person, for example, “data producer”, “data manager”.

A project URL should be used where possible, ideally one that can act as a [persistent identifier]("Globally unique and persistent identifier: a permanent reference to a digital resource, usually given as a URL, that takes the user to that resource.") for the dataset. The landing page for that URL should provide more context for the dataset, and describe high level information such as how the dataset relates to other project outputs, as well as providing other provenance information (dates, times, places, people). This could include more metadata about the type of study, objectives, protocols, dates, release versioning and so on. At the end of this episode, resources are given to help define metadata elements that could be included, including a project called [Dublin Core](https://www.dublincore.org/specifications/dublin-core/) which defines [15 such elements](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) or metadata fields.
A project URL should be used where possible, ideally one that can act as a **persistent identifier**, Globally unique and persistent identifier: a permanent reference to a digital resource, usually given as a URL, that takes the user to that resource, for the dataset. The landing page for that URL should provide more context for the dataset, and describe high level information such as how the dataset relates to other project outputs, as well as providing other provenance information (dates, times, places, people). This could include more metadata about the type of study, objectives, protocols, dates, release versioning and so on. At the end of this episode, resources are given to help define metadata elements that could be included, including a project called [Dublin Core](https://www.dublincore.org/specifications/dublin-core/) which defines [15 such elements](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) or metadata fields.

Terms of access and reuse are missing which could be rectified by including a [data licence](https://rdmkit.elixir-europe.org/licensing#what-licence-should-you-apply-to-your-research-data), which often appears as part of the metadata, usually at the bottom of a webpage hosting data.
There could be ambiguity around the acronym, “BPM”, used in the third column header, so this should be defined within a glossary of acronyms and or ideally hyperlinked to a definition in an existing ontology.
Expand All @@ -115,13 +115,13 @@ There are issues too with the data, as well as the metadata. The second column D

# Writing FAIR metadata

We have discussed already how rich metadata enables a dataset to be reused and interpreted correctly. In the context of the FAIR principles, the previous exercise illustrates two of these, namely that _“(Meta)data are richly described with a plurality of accurate and relevant attributes”_ (FAIR Principle R1) and that _“(Meta)data are associated with detailed provenance”_ (FAIR Principle R1,2). Further to this, the suggested use of the published {% cite diseaseontology %} for data, illustrates a further three principles, where _“(Meta)data use vocabularies,"Vocabularies: (or controlled vocabulary) is a dictionary of terms you can use when producing (meta)data.", that follow FAIR principles”_ (FAIR Principle I2), and _“(Meta)data meet domain-relevant community standards, "Community standards: standard guidelines used to structure and exchange data, usually supported by community-developed resources and/or software.", ”_ (FAIR Principle R1.3). The use of hyperlinks specifically to terms in the ontology means that _“(Meta)data include qualified references, "Qualified references: terms used to describe relationships to pieces of (meta)data.", to other (meta)data”_ (FAIR Principle I3). From the previous exercise, the [disease ontology](https://disease-ontology.org/) provides the vocabulary for the different types of diabetes: type 1 diabetes mellits {% cite diseaseontology %} and type 2 diabetes mellitus {% cite diseaseontology %}.
We have discussed already how rich metadata enables a dataset to be reused and interpreted correctly. In the context of the FAIR principles, the previous exercise illustrates two of these, namely that _“(Meta)data are richly described with a plurality of accurate and relevant attributes”_ (FAIR Principle R1) and that _“(Meta)data are associated with detailed provenance”_ (FAIR Principle R1,2). Further to this, the suggested use of the published {% cite diseaseontology %} for data, illustrates a further three principles, where _“(Meta)data use **vocabularies**,Vocabularies: (or controlled vocabulary) is a dictionary of terms you can use when producing (meta)data, that follow FAIR principles”_ (FAIR Principle I2), and _“(Meta)data meet domain-relevant **community standards**, Community standards: standard guidelines used to structure and exchange data, usually supported by community-developed resources and/or software, (FAIR Principle R1.3). The use of hyperlinks specifically to terms in the ontology means that Metadata include *qualified references*, Qualified references: terms used to describe relationships to pieces of (meta)data. , to other Metadata (FAIR Principle I3). From the previous exercise, the [disease ontology](https://disease-ontology.org/) provides the vocabulary for the different types of diabetes: type 1 diabetes mellits {% cite diseaseontology %} and type 2 diabetes mellitus {% cite diseaseontology %}.

The FAIR Guiding Principles also highlight the importance of providing rich metadata to enable researchers to **find** datasets such that _“Data are described with rich metadata”_ (FAIR Principle F2). More often than not, a researcher will find data through searching its metadata, usually via an online or a database search. Information on how this can be achieved is discussed in the next episode on data registration.

The use of vocabularies and cross-references is fundamental to data interoperability. Interoperable (meta)data can be linked and combined across studies, aided by consistent, compatible and machine-readable curation. [FAIRsharing](https://fairsharing.org/search?fairsharingRegistry=Standard) is a useful registry of vocabularies, and standards, while more comprehensive ontology lists are maintained by [OLS (Ontology Lookup Service)](https://www.ebi.ac.uk/ols/index) and [BioPortal](https://bioportal.bioontology.org/).

The previous exercise also touches on machine-readability, "Machine-readable: (meta)data is supplied in a structured format that can be read by a computer.", of (meta)data through mention of using controlled terms in the “DISEASE TYPE” column to allow subsetting. The [Open Data handbook](https://opendatahandbook.org/glossary/en/terms/machine-readable/) gives a nice overview of machine-readable (meta)data but in short it is (meta)data supplied in a defined and structured format that can easily be read by an appropriate script or piece of software. If we use our example of a spreadsheet in comma-separated value form (CSV format), the (meta)data will be organised into cells, in a format that is interoperable with many software . This would not be true if the same data were made available as a screenshot, highlighting that human-readable data may not be machine-readable.
The previous exercise also touches on **machine-readability**, Machine-readable: (meta)data is supplied in a structured format that can be read by a computer, of (meta)data through mention of using controlled terms in the “DISEASE TYPE” column to allow subsetting. The [Open Data handbook](https://opendatahandbook.org/glossary/en/terms/machine-readable/) gives a nice overview of machine-readable (meta)data but in short it is (meta)data supplied in a defined and structured format that can easily be read by an appropriate script or piece of software. If we use our example of a spreadsheet in comma-separated value form (CSV format), the (meta)data will be organised into cells, in a format that is interoperable with many software . This would not be true if the same data were made available as a screenshot, highlighting that human-readable data may not be machine-readable.

# Rich metadata in public data repositories

Expand Down

0 comments on commit 1df0d78

Please sign in to comment.