diff --git a/topics/fair/tutorials/fair-data-registration/tutorial.md b/topics/fair/tutorials/fair-data-registration/tutorial.md index 26cf0c943e45b9..972a79891d0033 100644 --- a/topics/fair/tutorials/fair-data-registration/tutorial.md +++ b/topics/fair/tutorials/fair-data-registration/tutorial.md @@ -81,7 +81,7 @@ Table 3.1: The 15 FAIR Guiding Principles. Principles relating to data registrat Data deposition and registration refer to the process of uploading data to a searchable resource, and providing appropriate metadata to facilitate its discoverability. For example, a data repository, where data and metadata can be uploaded, may enable it to be discovered, preserved and accessed. Here we use the general term data repository to describe any online storage location that can host deposited (meta)data. -In the context of FAIR, data deposition relates to a number of the Guiding Principles. Firstly, _“(meta)data are registered or indexed [in a searchable resource](## "Indexed in a searchable resource: a resource where (meta)data are organised so that they can be queried based on defined fields.")”_ (FAIR Principle F4). Searchable (indexed) metadata enables humans and computers to query and discover data of interest, though this depends on what is indexed. Here, indexing refers to a process that occurs within the architecture of the data repository (local indexing) where metadata are organised so that they can be queried based on a defined field. It is worth noting that community resources, focused on a particular domain (for example, the human database in [Ensembl](https://www.ensembl.org/Homo_sapiens/Info/Index)) are better indexed for a particular community, rather than generic repositories (for example, [Zenodo](https://zenodo.org/)) which may not index the community specific components, and may focus on higher level metadata. Indexing by an internet search engine is another example of this. Google (and other search engines, such as yahoo and yandex) have an agreed vocabulary ([schema.org](https://schema.org/)), within web pages, that are ‘scraped’ and indexed. While the focus of this vocabulary was originally intended for commercial products, community specific efforts to facilitate discipline-specific indexing are under way (for example, [Bioschemas](https://faircookbook.elixir-europe.org/content/recipes/findability/seo/bioschemas-data-page.html)). +In the context of FAIR, data deposition relates to a number of the Guiding Principles. Firstly, _“(meta)data are registered or indexed [in a searchable resource]("Indexed in a searchable resource: a resource where (meta)data are organised so that they can be queried based on defined fields.")”_ (FAIR Principle F4). Searchable (indexed) metadata enables humans and computers to query and discover data of interest, though this depends on what is indexed. Here, indexing refers to a process that occurs within the architecture of the data repository (local indexing) where metadata are organised so that they can be queried based on a defined field. It is worth noting that community resources, focused on a particular domain (for example, the human database in [Ensembl](https://www.ensembl.org/Homo_sapiens/Info/Index)) are better indexed for a particular community, rather than generic repositories (for example, [Zenodo](https://zenodo.org/)) which may not index the community specific components, and may focus on higher level metadata. Indexing by an internet search engine is another example of this. Google (and other search engines, such as yahoo and yandex) have an agreed vocabulary ([schema.org](https://schema.org/)), within web pages, that are ‘scraped’ and indexed. While the focus of this vocabulary was originally intended for commercial products, community specific efforts to facilitate discipline-specific indexing are under way (for example, [Bioschemas](https://faircookbook.elixir-europe.org/content/recipes/findability/seo/bioschemas-data-page.html)). # Why should I upload my data to a data repository? diff --git a/topics/fair/tutorials/fair-metadata/tutorial.bib b/topics/fair/tutorials/fair-metadata/tutorial.bib index b7d32dcdb2d152..695b49fd7cd175 100644 --- a/topics/fair/tutorials/fair-metadata/tutorial.bib +++ b/topics/fair/tutorials/fair-metadata/tutorial.bib @@ -18,4 +18,11 @@ @article{Sarkans2021 year = {2021}, month = may, pages = {1418–1422} +} + +@online{diseaseontology, + author = {disease-ontology}, + title = {The Disease Ontology is a formal ontology of human disease.}, + url = {https://disease-ontology.org/?id=DOID:9352}, + urldate = {2024-03-15} } \ No newline at end of file diff --git a/topics/fair/tutorials/fair-metadata/tutorial.md b/topics/fair/tutorials/fair-metadata/tutorial.md index e676676cda552a..0cb11226bc0274 100644 --- a/topics/fair/tutorials/fair-metadata/tutorial.md +++ b/topics/fair/tutorials/fair-metadata/tutorial.md @@ -78,7 +78,7 @@ Table 2.1: The 15 FAIR Guiding Principles. Principles relating directly to metad Metadata is information that describes your data - it is data about data. -The provision of ‘rich’ metadata is key to FAIR since it allows data to be found, and enables other researchers to interpret data appropriately. ‘Rich’ in this context refers to extensive metadata, often connecting data to other data or terms (even in other datasets), with [qualified references](## "Qualified references: terms used to describe relationships to pieces of (meta)data.") specifying how they are connected. +The provision of ‘rich’ metadata is key to FAIR since it allows data to be found, and enables other researchers to interpret data appropriately. ‘Rich’ in this context refers to extensive metadata, often connecting data to other data or terms (even in other datasets), with [qualified references]("Qualified references: terms used to describe relationships to pieces of (meta)data.") specifying how they are connected. If a researcher is given access to a dataset (a spreadsheet in CSV format, for example), the data is not usable without meaningful column headings and context. For example, what is this data about? Is it part of a larger dataset and, if so, how is it related? What are the column headings representing, and what are the rows representing? Additionally, if the values in the individual cells are device or assay measurements, what device was used, and what assay? @@ -89,16 +89,16 @@ Though this Figure 2.1 shows the difference between metadata and data, it does ![spreadsheet](../../images/figure2-1_human-welsh-cohort.png "A spreadsheet showing the relationship between data (orange) and metadata (blue)") -We are given only 2 pieces of [data provenance](## "Data provenance: metadata describing the origin of a piece of data, including information such as version, original location of the data, and usually an audit trail up to the current version.") in this example (i) the study name “LONDON DIABETES COHORT” which could help discover other documentation about the project; (ii) the name of the person who presumably led the study. While these may help to search for further information, for a human user, this would still be fraught with issues since both names may not uniquely identify the study or its lead. For a machine agent, such as a script, interpretation is made difficult because the metadata is not clearly marked-up (represented formally with tags). +We are given only 2 pieces of [data provenance]("Data provenance: metadata describing the origin of a piece of data, including information such as version, original location of the data, and usually an audit trail up to the current version.") in this example (i) the study name “LONDON DIABETES COHORT” which could help discover other documentation about the project; (ii) the name of the person who presumably led the study. While these may help to search for further information, for a human user, this would still be fraught with issues since both names may not uniquely identify the study or its lead. For a machine agent, such as a script, interpretation is made difficult because the metadata is not clearly marked-up (represented formally with tags). To fix this, contact details for investigators could be included. Often [ORCID IDs](https://info.orcid.org/what-is-orcid/) associated with names are used since they uniquely identify an individual and do not pose problems when an investigator moves institutions and the original (email) address changes. Coupled with this, investigators can be formally assigned a role in the provenance metadata allowing people to contact the appropriate person, for example, “data producer”, “data manager”. -A project URL should be used where possible, ideally one that can act as a [persistent identifier](## "Globally unique and persistent identifier: a permanent reference to a digital resource, usually given as a URL, that takes the user to that resource.") for the dataset. The landing page for that URL should provide more context for the dataset, and describe high level information such as how the dataset relates to other project outputs, as well as providing other provenance information (dates, times, places, people). This could include more metadata about the type of study, objectives, protocols, dates, release versioning and so on. At the end of this episode, resources are given to help define metadata elements that could be included, including a project called [Dublin Core](https://www.dublincore.org/specifications/dublin-core/) which defines [15 such elements](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) or metadata fields. +A project URL should be used where possible, ideally one that can act as a [persistent identifier]("Globally unique and persistent identifier: a permanent reference to a digital resource, usually given as a URL, that takes the user to that resource.") for the dataset. The landing page for that URL should provide more context for the dataset, and describe high level information such as how the dataset relates to other project outputs, as well as providing other provenance information (dates, times, places, people). This could include more metadata about the type of study, objectives, protocols, dates, release versioning and so on. At the end of this episode, resources are given to help define metadata elements that could be included, including a project called [Dublin Core](https://www.dublincore.org/specifications/dublin-core/) which defines [15 such elements](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) or metadata fields. Terms of access and reuse are missing which could be rectified by including a [data licence](https://rdmkit.elixir-europe.org/licensing#what-licence-should-you-apply-to-your-research-data), which often appears as part of the metadata, usually at the bottom of a webpage hosting data. There could be ambiguity around the acronym, “BPM”, used in the third column header, so this should be defined within a glossary of acronyms and or ideally hyperlinked to a definition in an existing ontology. -There are issues too with the data, as well as the metadata. The second column DISEASE TYPE could be better designed. Two pieces of information (data) are depicted in the same column: disease type (diabetes) and disease stage (early/late). Ideally these should be in 2 separate columns allowing researchers to subset on stage and disease type independently for downstream analysis. There are also four different terms used for diabetes (“Diabetes Mellitus II”, “Diabetes”, “Diabetes Mellitus” and “Diabetes Mellitus I”), which again does not allow a researcher to subset data efficiently. To fix this you would use defined terms within an existing vocabulary or ontology. The following [link](https://disease-ontology.org/?id=DOID:9352) accesses a disease ontology we could use, where each term (for example, “type 2 diabetes mellitus”) is described and assigned a unique ID. In the example above you would use this unique ID, or the associated descriptive term, to tag all patients with the same disease, identically. This then makes the data sub-setable and machine-readable. +There are issues too with the data, as well as the metadata. The second column DISEASE TYPE could be better designed. Two pieces of information (data) are depicted in the same column: disease type (diabetes) and disease stage (early/late). Ideally these should be in 2 separate columns allowing researchers to subset on stage and disease type independently for downstream analysis. There are also four different terms used for diabetes (“Diabetes Mellitus II”, “Diabetes”, “Diabetes Mellitus” and “Diabetes Mellitus I”), which again does not allow a researcher to subset data efficiently. To fix this you would use defined terms within an existing vocabulary or ontology. The following accesses a {% cite diseaseontology %} we could use, where each term (for example, “type 2 diabetes mellitus”) is described and assigned a unique ID. In the example above you would use this unique ID, or the associated descriptive term, to tag all patients with the same disease, identically. This then makes the data sub-setable and machine-readable. > @@ -115,19 +115,19 @@ There are issues too with the data, as well as the metadata. The second column D # Writing FAIR metadata -We have discussed already how rich metadata enables a dataset to be reused and interpreted correctly. In the context of the FAIR principles, the previous exercise illustrates two of these, namely that _“(Meta)data are richly described with a plurality of accurate and relevant attributes”_ (FAIR Principle R1) and that _“(Meta)data are associated with detailed provenance”_ (FAIR Principle R1,2). Further to this, the suggested use of the published [disease ontology]([https://disease-ontology.org/?id=DOID:9352) for data, illustrates a further three principles, where _“(Meta)data use [vocabularies](## "Vocabularies: (or controlled vocabulary) is a dictionary of terms you can use when producing (meta)data.") that follow FAIR principles”_ (FAIR Principle I2), and _“(Meta)data meet [domain-relevant community standards] (## "Community standards: standard guidelines used to structure and exchange data, usually supported by community-developed resources and/or software.")”_ (FAIR Principle R1.3). The use of hyperlinks specifically to terms in the ontology means that _“(Meta)data include [qualified references](## "Qualified references: terms used to describe relationships to pieces of (meta)data.")to other (meta)data”_ (FAIR Principle I3). From the previous exercise, the [disease ontology](https://disease-ontology.org/) provides the vocabulary for the different types of diabetes: [type 1 diabetes mellits](https://disease-ontology.org/?id=DOID:9352) and [type 2 diabetes mellitus](https://disease-ontology.org/?id=DOID:9352). +We have discussed already how rich metadata enables a dataset to be reused and interpreted correctly. In the context of the FAIR principles, the previous exercise illustrates two of these, namely that _“(Meta)data are richly described with a plurality of accurate and relevant attributes”_ (FAIR Principle R1) and that _“(Meta)data are associated with detailed provenance”_ (FAIR Principle R1,2). Further to this, the suggested use of the published {% cite diseaseontology %} for data, illustrates a further three principles, where _“(Meta)data use vocabularies,"Vocabularies: (or controlled vocabulary) is a dictionary of terms you can use when producing (meta)data.", that follow FAIR principles”_ (FAIR Principle I2), and _“(Meta)data meet domain-relevant community standards, "Community standards: standard guidelines used to structure and exchange data, usually supported by community-developed resources and/or software.", ”_ (FAIR Principle R1.3). The use of hyperlinks specifically to terms in the ontology means that _“(Meta)data include qualified references, "Qualified references: terms used to describe relationships to pieces of (meta)data.", to other (meta)data”_ (FAIR Principle I3). From the previous exercise, the [disease ontology](https://disease-ontology.org/) provides the vocabulary for the different types of diabetes: type 1 diabetes mellits {% cite diseaseontology %} and type 2 diabetes mellitus {% cite diseaseontology %}. The FAIR Guiding Principles also highlight the importance of providing rich metadata to enable researchers to **find** datasets such that _“Data are described with rich metadata”_ (FAIR Principle F2). More often than not, a researcher will find data through searching its metadata, usually via an online or a database search. Information on how this can be achieved is discussed in the next episode on data registration. The use of vocabularies and cross-references is fundamental to data interoperability. Interoperable (meta)data can be linked and combined across studies, aided by consistent, compatible and machine-readable curation. [FAIRsharing](https://fairsharing.org/search?fairsharingRegistry=Standard) is a useful registry of vocabularies, and standards, while more comprehensive ontology lists are maintained by [OLS (Ontology Lookup Service)](https://www.ebi.ac.uk/ols/index) and [BioPortal](https://bioportal.bioontology.org/). -The previous exercise also touches on [machine-readability](## "Machine-readable: (meta)data is supplied in a structured format that can be read by a computer.") of (meta)data through mention of using controlled terms in the “DISEASE TYPE” column to allow subsetting. The [Open Data handbook](https://opendatahandbook.org/glossary/en/terms/machine-readable/) gives a nice overview of machine-readable (meta)data but in short it is (meta)data supplied in a defined and structured format that can easily be read by an appropriate script or piece of software. If we use our example of a spreadsheet in comma-separated value form (CSV format), the (meta)data will be organised into cells, in a format that is interoperable with many software . This would not be true if the same data were made available as a screenshot, highlighting that human-readable data may not be machine-readable. +The previous exercise also touches on machine-readability, "Machine-readable: (meta)data is supplied in a structured format that can be read by a computer.", of (meta)data through mention of using controlled terms in the “DISEASE TYPE” column to allow subsetting. The [Open Data handbook](https://opendatahandbook.org/glossary/en/terms/machine-readable/) gives a nice overview of machine-readable (meta)data but in short it is (meta)data supplied in a defined and structured format that can easily be read by an appropriate script or piece of software. If we use our example of a spreadsheet in comma-separated value form (CSV format), the (meta)data will be organised into cells, in a format that is interoperable with many software . This would not be true if the same data were made available as a screenshot, highlighting that human-readable data may not be machine-readable. # Rich metadata in public data repositories Help with rich metadata curation is often supported by public data repositories, and data deposition is one way you can improve its level of FAIRness. During submission, metadata is composed and linked, making it understood, accessible and searchable. -The figure below shows a screenshot of a dataset hosted by the [BioStudies](https://www.ebi.ac.uk/biostudies/) (Figure 2.2). BioStudies is a public database holding descriptions for biological studies and their (meta)data, and is often used by researchers to provide [primary identifiers](## " ") to supplementary information described in publications. +The figure below shows a screenshot of a dataset hosted by the [BioStudies](https://www.ebi.ac.uk/biostudies/) (Figure 2.2). BioStudies is a public database holding descriptions for biological studies and their (meta)data, and is often used by researchers to provide **primary identifiers** to supplementary information described in publications. ![BioStudies](../../images/figure2-2_rnaseq-database.png "A screenshot of a plant RNAseq dataset housed in BioStudies showing rich metadata {% cite RubénSchlaen %}") @@ -145,7 +145,7 @@ The figure below shows a screenshot of a dataset hosted by the [BioStudies](http # Using community standards for (meta)data -Public databases often serve communities and specific types of data, and may often use community standards for metadata curation. These standards include, usually, open-access [ontologies](## " ") that can be used by researchers to annotate their (meta)data. The [FAIRsharing](https://fairsharing.org/search?fairsharingRegistry=Standard) initiative provides a curated, searchable resource to help find many of these. The [disease ontology](https://disease-ontology.org/) we have mentioned already in the first exercise has its [own page](https://fairsharing.org/FAIRsharing.8b6wfq) in FAIRsharing. +Public databases often serve communities and specific types of data, and may often use community standards for metadata curation. These standards include, usually, open-access **ontologies** that can be used by researchers to annotate their (meta)data. The [FAIRsharing](https://fairsharing.org/search?fairsharingRegistry=Standard) initiative provides a curated, searchable resource to help find many of these. The [disease ontology](https://disease-ontology.org/) we have mentioned already in the first exercise has its [own page](https://fairsharing.org/FAIRsharing.8b6wfq) in FAIRsharing. Another useful resource serving the data needs of specific communities is [RDMkit](https://rdmkit.elixir-europe.org/). RDMkit is an online research data management toolkit for Life Sciences, and as part of its product hosts pages for domain-specific best practices and guidelines. [Domain pages](https://rdmkit.elixir-europe.org/your_domain) signpost detail and promote relevant considerations, tools and resources. diff --git a/topics/fair/tutorials/fair-origin/tutorial.bib b/topics/fair/tutorials/fair-origin/tutorial.bib index c86b16a3661c7f..e8c877e5ce88a3 100644 --- a/topics/fair/tutorials/fair-origin/tutorial.bib +++ b/topics/fair/tutorials/fair-origin/tutorial.bib @@ -26,4 +26,4 @@ @article{Cox2021 year = {2021}, month = jun, pages = {e1009041} -} \ No newline at end of file +}