tutorial_dataset_1.qmd

# Tutorial 1: Adding a simple dataset

## Overview

This is the first of five tutorials on adding datasets to your `traits.build` database. This introduces you to the basic functions, the user input required, and the manual manipulations required to complete the dataset's metadata file. The next four tutorials introduce you to progressively more complex datasets, functions, and decisions.

Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in `traits.build-template` . Instructions are available at [Tutorial: Example compilation](tutorial_compilation.html).

### Goals

-   Learn how to [build a metadata.yml file](#build_metadata) for a dataset.

-   Learn how to [merge a new dataset](#build_pipline) into a `traits.build` database.

### New functions introduced

-   metadata_create_template

-   metadata_add_source_doi

-   metadata_add_locations

-   metadata_add_traits

-   dataset_test

-   build_setup_pipeline

------------------------------------------------------------------------

## Adding tutorial_dataset_1

### Ensure the dataset folder contains the correct data files

In the traits.build-template repository, there is a folder titled `tutorial_dataset_1` within the data folder. 

-   Ensure that this folder exists on your computer. 

-   The file `data.csv` exists within the `tutorial_dataset_1` folder. 

-   There is a folder `raw` nested within the `tutorial_dataset_1` folder, that contains one file, `notes.txt`. 

### Source necessary functions

-   Source the functions in the `traits.build` package:

```{r, eval=FALSE}
library(traits.build)
```

### Use functions to create a metadata.yml file {#build_metadata}

#### **Create a metadata template**

All dataset metadata is documented within a .yml file that also resides within the dataset's folder.

A function quickly creates the skeletal `metadata.yml` file.

```{r, eval=FALSE}
metadata_create_template("tutorial_dataset_1")
```

This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date).\

The menus are shown below, with the menu in [**blue**]{style="color:blue;"} and the appropriate user input in [**red**]{style="color:red;"}.

[Is the data long or wide format?]{style="color:blue;"}\

[1: Long]{style="color:blue;"}\
[2: Wide]{style="color:blue;"}

[Selection:]{style="color:blue;"} [**2**]{style="color:red;"}

This dataset is considered `wide`, because the data for each trait is documented in its own column.

[Select column for taxon_name]{style="color:blue;"}\

[1: Species]{style="color:blue;"}\
[2: site]{style="color:blue;"}\
[3: LMA (mg mm-2)]{style="color:blue;"}\
[4: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[5: leaf size (mm2)]{style="color:blue;"}\
[6: latitude (deg)]{style="color:blue;"}\
[7: longitude (deg)]{style="color:blue;"}\
[8: description\`]{style="color:blue;"}

[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}

Select [`1`]{style="color:red;"} since `taxon names` are documented in the column `Species`.

[Select column for location_name]{style="color:blue;"}

[1: NA]{style="color:blue;"}\
[2: Species]{style="color:blue;"}\
[3: site]{style="color:blue;"}\
[4: LMA (mg mm-2)]{style="color:blue;"}\
[5: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[6: leaf size (mm2)]{style="color:blue;"}\
[7: latitude (deg)]{style="color:blue;"}\
[8: longitude (deg)]{style="color:blue;"}\
[9: description]{style="color:blue;"}

[Selection:]{style="color:blue;"} [**3**]{style="color:red;"}

Select `3` since `location names` are documented in the column `site`.

[Select column for individual_id]{style="color:blue;"}

[1: NA]{style="color:blue;"}\
[2: Species]{style="color:blue;"}\
[3: site]{style="color:blue;"}\
[4: LMA (mg mm-2)]{style="color:blue;"}\
[5: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[6: leaf size (mm2)]{style="color:blue;"}\
[7: latitude (deg)]{style="color:blue;"}\
[8: longitude (deg)]{style="color:blue;"}\
[9: description]{style="color:blue;"}\

[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}

This dataset does not include a column for `individual_id`, so `1: NA` is the appropriate input.

[Select column for collection_date]{style="color:blue;"}

[1: NA]{style="color:blue;"}\
[2: Species]{style="color:blue;"}\
[3: site]{style="color:blue;"}\
[4: LMA (mg mm-2)]{style="color:blue;"}\
[5: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[6: leaf size (mm2)]{style="color:blue;"}\
[7: latitude (deg)]{style="color:blue;"}\
[8: longitude (deg)]{style="color:blue;"}\
[9: description]{style="color:blue;"}

[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}

This dataset does not include a column for `collection_date`, so `1: NA` is the appropriate input.

A follow-up question then allows you to add a fixed `collection_date` as a range. The information can be manually updated later.

[Enter collection_date range in format '2007/2009':]{style="color:blue;"} [**2002-11/2002-11**]{style="color:red;"}\

A final user prompt asks if, for any traits, a sequence of rows represents repeat observations.\

[Do all traits need `repeat_measurements_id`'s?]{style="color:blue;"}

[1: Yes]{style="color:blue;"}
[2: No]{style="color:blue;"}

This only occurs if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour) and the answer is almost always `no`.\

[**2**]{style="color:red;"}\

*Navigate to the dataset's folder to find the metadata.yml file.*\

*Open this file in Visual Studio Code (or another text-based editor of choice; NOT Word!), so you can see how it is progressively filled in as you work through the next steps.*

------------------------------------------------------------------------

#### **Propagate source information into the metadata.yml file**

This dataset is from a published source with a `doi` and therefore the source information can be added with a single line of code:

```{r, eval=FALSE}
metadata_add_source_doi(
  dataset_id = "tutorial_dataset_1", 
  doi = "10.1111/j.0022-0477.2005.00992.x"
  )
```

The following information is automatically propagated into the source field:

```{r, eval=FALSE}
primary:
  key: Test_1
  bibtype: Article
  year: '2005'
  author: Daniel S. Falster and Mark Westoby
  journal: Journal of Ecology
  title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
  volume: '93'
  number: '3'
  pages: 521--535
  doi: 10.1111/j.0022-0477.2005.00992.x
```

Once you've run this line of code, look at the metadata file to confirm:

1.  the authors' names are formatted as `first name last name` or `first initial last name` (`Daniel S. Falster` or `D. S. Falster` if first names weren't available)\
2.  sequential author's names are separated by `and` 
3.  the article title is in sentence case\
4.  the page numbers are filled in as a range, separated by a double dash (`521--535` is correct) 

*Note, there is also a function `metadata_add_source_bibtex` if your source information is in this format.*

------------------------------------------------------------------------

#### **Add location details**

Location data can be automatically propagated into the metadata file if it is available in tabular format. For instance, for this study:

```{r, eval=FALSE}
locations <-
  read_csv("data/tutorial_dataset_1/data.csv") %>%
    select(site, description, `latitude (deg)`, `longitude (deg)`) %>%
    distinct()
```

You can then add this location information directly into the metadata file by running:

```{r, eval=FALSE}
metadata_add_locations(dataset_id = "tutorial_dataset_1", location_data = locations)
```

This leads to the following user prompts:

[Select column for location_name]{style="color:blue;"} 

[1: site]{style="color:blue;"}\
[2: description]{style="color:blue;"}\
[3: latitude (deg)]{style="color:blue;"}\
[4: longitude (deg)]{style="color:blue;"}\

[Selection:]{style="color:blue;"} [**1**]{style="color:red;"} 

Select the same column that you indicated contained `location` names when you created the metadata template.

[Indicate all columns you wish to keep as distinct location_properties in tutorial_dataset_1 (by number separated by space; e.g. '1 2 4'):]{style="color:blue;"}\

[1: description]{style="color:blue;"}\
[2: latitude (deg)]{style="color:blue;"}\
[3: longitude (deg)]{style="color:blue;"}\

[Selection:]{style="color:blue;"} [**1 2 3**]{style="color:red;"}

Select all columns that include `location properties` that should be documented within the `metadata.yml` file. In this case, it is all three columns.

[Following locations added to metadata for tutorial_dataset_1: 'Atherton', 'Cape Tribulation']{style="color:blue;"}\
[with variables 'description', 'latitude (deg)', 'longitude (deg)']{style="color:blue;"}\
[Please complete information in data/tutorial_dataset_1/metadata.yml]{style="color:blue;"}\

All available location data has now been automatically added to the `metadata.yml` file.

```{r, eval=FALSE}
locations:
  Atherton:
    description: Tropical rain forest vegetation
    latitude (deg): -17.117
    longitude (deg): 145.65
  Cape Tribulation:
    description: Complex mesophyll vine forest in tropical rain forest
    latitude (deg): -16.1
    longitude (deg): 145.45
```

------------------------------------------------------------------------

#### **Add traits**

The next step is to select which columns in the `data.csv` file have trait information you want to include in the database.

The function `metadata_add_traits` automatically adds the trait-scaffold to `metadata.yml`:

```{r, eval=FALSE}
metadata_add_traits(dataset_id = "tutorial_dataset_1")
```

The user is prompted to select the columns with trait data.

[Indicate all columns you wish to keep as distinct traits in tutorial_dataset_1 (by number separated by space; e.g. '1 2 4'):]{style="color:blue;"}\

[1: Species]{style="color:blue;"}\
[2: site]{style="color:blue;"}\
[3: LMA (mg mm-2)]{style="color:blue;"}\
[4: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[5: leaf size (mm2)]{style="color:blue;"}\
[6: latitude (deg)]{style="color:blue;"}\
[7: longitude (deg)]{style="color:blue;"}\
[8: description]{style="color:blue;"}\

[Selection:]{style="color:blue;"} [**3 4 5**]{style="color:red;"}\

You select columns 3, 4, 5, as these contain trait data.

[Following traits added to metadata for tutorial_dataset_1: 'LMA (mg mm-2)', 'Leaf nitrogen (mg mg-1)', 'leaf size (mm2)']{style="color:blue;"}\
[Please complete information in data/tutorial_dataset_1/metadata.yml]{style="color:blue;"}\

`metadata.yml` now includes a framework in which to manually fill in details about each trait:

```{r, eval=FALSE}
traits:
- var_in: LMA (mg mm-2)
  unit_in: unknown
  trait_name: unknown
  entity_type: unknown
  value_type: unknown
  basis_of_value: unknown
  replicates: unknown
  methods: unknown
- var_in: Leaf nitrogen (mg mg-1)
  unit_in: unknown
  trait_name: unknown
  entity_type: unknown
  value_type: unknown
  basis_of_value: unknown
  replicates: unknown
  methods: unknown
- var_in: leaf size (mm2)
  unit_in: unknown
  trait_name: unknown
  entity_type: unknown
  value_type: unknown
  basis_of_value: unknown
  replicates: unknown
  methods: unknown
```

------------------------------------------------------------------------

### Manual filling in of metadata

The remaining fields within the metadata.yml file must now be filled in manually.

These include:\
* the `contributors` section\
* the `description`, `basis_of_record`, `life_stage`, `sampling_strategy`, `original_file`, and `notes` under the `dataset` section\
* details for each trait, including `unit_in`, `trait_name`, `entity_type`, `value_type`, `basis_of_record`, `replicates` and `methods`

These are all fields that contain the word `unknown`.

#### **Adding contributors**

| Contributor field     | Information to add                                                                                                         |
|----------------------|--------------------------------------------------|
| last_name, first_name | The contributors first and last names should be available from the source                                                  |
| ORCID                 | Contributors are identified by their ORCID, available for most active researches at [orcid.org](https://orcid.org/)        |
| affiliation           | Available from the source or the orcid.org website. Use the same syntax for the same affiliation throughout your database. |
| additional_role       | For the lead dataset contributor, add the field: `additional_role: contact`                                                |

-   You can add multiple data collectors by duplicating the relevant 4 lines of code; see the [Adding dataset vignette](adding_data.html) for protocols on who to add as a data collector.

-   The line `assistants:` can be deleted if there aren't any assistants' names to add.

-   Add yourself as the `dataset_curator`.\

#### **Dataset fields**

-   The file `data/tutorial_dataset_1/raw/tutorial_dataset_1_notes.txt` indicates how to fill in the `unknown` dataset fields for this study.

-   In general, the information to fill in these fields should be available from the source (article) or obtained directly from the dataset contributor.\

| Dataset field    | Information to add                                                                                                                            |
|---------------------|---------------------------------------------------|
| basis_of_record  | See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| life_stage       | See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| description      | A 1-2 sentence summary of the dataset. This can generally be formulated by information in the abstract.                                       |
| sampling_stategy | A description of how sites and sampling protocols were chosen. Can generally be taken verbatim from the methods section of a manuscript.      |
| original_file    | Name of the file submitted by the data contributor and archived in the raw folder.                                                            |
| notes            | none (or `.na`) for this study, but any notes added by the data curator about data quality, edits to the data during dataset curation.        |

#### **Trait details**

##### **trait_name**

The `trait_name` must match a `trait_name` within the [traits dictionary](https://github.com/traitecoevo/traits.build-template/blob/master/config/traits.yml). For this example:

| column in dataset       | trait concept       |
|-------------------------|---------------------|
| LMA (mg mm-2)           | leaf_mass_per_area  |
| Leaf nitrogen (mg mg-1) | leaf_N_per_dry_mass |
| leaf size (mm2)         | leaf_area           |

A dataset curator must be familiar with the likely traits in their discipline to accurately match those in a contributed dataset to traits in the dictionary, and be able to determine if a new trait definition is warranted.

##### **unit_in**

Units are formatted according to the [UCUM convention](https://ucum.org/ucum):

-   units in the numerator are separated by a '.'/
-   units in the denominator are each preceded by a '/'./
-   "extra" information that is commonly informally included as part of the units for clarity can be included in curly brackets, {}

As examples:

| units                                            | UCUM format    |
|--------------------------------------------------|----------------|
| milligram per square millimetre                  | mg/mm2         |
| micromole per square metre second                | umol/m2/s      |
| micromole carbon dioxide per square metre second | umol{CO2}/m2/s |

If the units being read in for a specific trait differ from those defined for the trait in the [traits dictionary](https://github.com/traitecoevo/traits.build-template/blob/master/config/traits.yml) the trait values are converted using the conversion rules specified in [unit_conversions.csv](https://github.com/traitecoevo/traits.build-template/blob/master/config/unit_conversions.csv).

##### **entity_type, value_type, basis_of_value, replicates, methods**

| field          | value for this dataset                                                                                                                          | description                                                                                                                                                                                                              |
|------------------|------------------|------------------------------------|
| entity_type    | population                                                                                                                                      | The entity corresponding to the trait value. Uses a controlled vocabulary. See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| value_type     | mean                                                                                                                                            | The statistical nature of the trait value. Uses a controlled vocabulary. See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms.   |
| basis_of_value | measurement                                                                                                                                     | How the trait value was obtained. See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms.                                          |
| replicates     | 3                                                                                                                                               | The number of replicate measurements that comprise the trait measurement recorded in the spreadsheet.                                                                                                                    |
| methods        | See the study's [metadata_notes.txt](https://github.com/traitecoevo/traits.build-template/blob/data/tutorial_dataset_1/raw/metadata_notes.txt) file | A verbatim (free-form) text field documenting the methods used to collect the trait measurements. This is generally available from the reference or directly from the author.                                            |

*The values for `entity_type`, `value_type`, `basis_of_value`, and `replicates` can vary by trait -- and indeed by measurement -- but for this study are identical for all traits.*

#### **Final steps**

##### **Double check the metadata.yml file**

You should now have a completed `metadata.yml` file, with no `unknown` fields.

You'll notice five sections we haven't used, `contexts`, `substitutions`, `taxonomic_updates`, `exclude_observations`, and `questions`.

These should each contain an `.na` (as in `substitutions: .na`). They will be explored in future lessons.

##### **Run tests on the metadata file**

Confirm there are no errors in the `metadata.yml` file:

```{r, eval=FALSE}
dataset_test("tutorial_dataset_1")
```

This *should* result in the following output:

[\[ FAIL 0 \| WARN 0 \| SKIP 0 \| PASS 79 \]]{style="color:blue;"}\

##### **Add dataset to the database** {#build_pipeline}

Next add the dataset_id to the build file that builds the database and rebuild the database

```{r, eval=FALSE}
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
```

##### **Build dataset report**

As a final step, build a report for the study

```{r, eval=FALSE}
traits.build_database$build_info$version <- "5.0.0"  
    # a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_1", traits.build_database, overwrite = TRUE)
```

Have a look at the report, but there reports become much more interesting once there are more datasets in the database.