-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutorial_dataset_1.qmd
446 lines (302 loc) · 20.1 KB
/
tutorial_dataset_1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
# Tutorial 1: Adding a simple dataset
## Overview
This is the first of five tutorials on adding datasets to your `traits.build` database. This introduces you to the basic functions, the user input required, and the manual manipulations required to complete the dataset's metadata file. The next four tutorials introduce you to progressively more complex datasets, functions, and decisions.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in `traits.build-template` . Instructions are available at [Tutorial: Example compilation](tutorial_compilation.html).
### Goals
- Learn how to [build a metadata.yml file](#build_metadata) for a dataset.
- Learn how to [merge a new dataset](#build_pipline) into a `traits.build` database.
### New functions introduced
- metadata_create_template
- metadata_add_source_doi
- metadata_add_locations
- metadata_add_traits
- dataset_test
- build_setup_pipeline
------------------------------------------------------------------------
## Adding tutorial_dataset_1
### Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled `tutorial_dataset_1` within the data folder.
- Ensure that this folder exists on your computer.
- The file `data.csv` exists within the `tutorial_dataset_1` folder.
- There is a folder `raw` nested within the `tutorial_dataset_1` folder, that contains one file, `notes.txt`.
### Source necessary functions
- Source the functions in the `traits.build` package:
```{r, eval=FALSE}
library(traits.build)
```
### Use functions to create a metadata.yml file {#build_metadata}
#### **Create a metadata template**
All dataset metadata is documented within a .yml file that also resides within the dataset's folder.
A function quickly creates the skeletal `metadata.yml` file.
```{r, eval=FALSE}
metadata_create_template("tutorial_dataset_1")
```
This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date).\
The menus are shown below, with the menu in [**blue**]{style="color:blue;"} and the appropriate user input in [**red**]{style="color:red;"}.
[Is the data long or wide format?]{style="color:blue;"}\
[1: Long]{style="color:blue;"}\
[2: Wide]{style="color:blue;"}
[Selection:]{style="color:blue;"} [**2**]{style="color:red;"}
This dataset is considered `wide`, because the data for each trait is documented in its own column.
[Select column for taxon_name]{style="color:blue;"}\
[1: Species]{style="color:blue;"}\
[2: site]{style="color:blue;"}\
[3: LMA (mg mm-2)]{style="color:blue;"}\
[4: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[5: leaf size (mm2)]{style="color:blue;"}\
[6: latitude (deg)]{style="color:blue;"}\
[7: longitude (deg)]{style="color:blue;"}\
[8: description\`]{style="color:blue;"}
[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}
Select [`1`]{style="color:red;"} since `taxon names` are documented in the column `Species`.
[Select column for location_name]{style="color:blue;"}
[1: NA]{style="color:blue;"}\
[2: Species]{style="color:blue;"}\
[3: site]{style="color:blue;"}\
[4: LMA (mg mm-2)]{style="color:blue;"}\
[5: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[6: leaf size (mm2)]{style="color:blue;"}\
[7: latitude (deg)]{style="color:blue;"}\
[8: longitude (deg)]{style="color:blue;"}\
[9: description]{style="color:blue;"}
[Selection:]{style="color:blue;"} [**3**]{style="color:red;"}
Select `3` since `location names` are documented in the column `site`.
[Select column for individual_id]{style="color:blue;"}
[1: NA]{style="color:blue;"}\
[2: Species]{style="color:blue;"}\
[3: site]{style="color:blue;"}\
[4: LMA (mg mm-2)]{style="color:blue;"}\
[5: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[6: leaf size (mm2)]{style="color:blue;"}\
[7: latitude (deg)]{style="color:blue;"}\
[8: longitude (deg)]{style="color:blue;"}\
[9: description]{style="color:blue;"}\
[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}
This dataset does not include a column for `individual_id`, so `1: NA` is the appropriate input.
[Select column for collection_date]{style="color:blue;"}
[1: NA]{style="color:blue;"}\
[2: Species]{style="color:blue;"}\
[3: site]{style="color:blue;"}\
[4: LMA (mg mm-2)]{style="color:blue;"}\
[5: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[6: leaf size (mm2)]{style="color:blue;"}\
[7: latitude (deg)]{style="color:blue;"}\
[8: longitude (deg)]{style="color:blue;"}\
[9: description]{style="color:blue;"}
[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}
This dataset does not include a column for `collection_date`, so `1: NA` is the appropriate input.
A follow-up question then allows you to add a fixed `collection_date` as a range. The information can be manually updated later.
[Enter collection_date range in format '2007/2009':]{style="color:blue;"} [**2002-11/2002-11**]{style="color:red;"}\
A final user prompt asks if, for any traits, a sequence of rows represents repeat observations.\
[Do all traits need `repeat_measurements_id`'s?]{style="color:blue;"}
[1: Yes]{style="color:blue;"}
[2: No]{style="color:blue;"}
This only occurs if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour) and the answer is almost always `no`.\
[**2**]{style="color:red;"}\
*Navigate to the dataset's folder to find the metadata.yml file.*\
*Open this file in Visual Studio Code (or another text-based editor of choice; NOT Word!), so you can see how it is progressively filled in as you work through the next steps.*
------------------------------------------------------------------------
#### **Propagate source information into the metadata.yml file**
This dataset is from a published source with a `doi` and therefore the source information can be added with a single line of code:
```{r, eval=FALSE}
metadata_add_source_doi(
dataset_id = "tutorial_dataset_1",
doi = "10.1111/j.0022-0477.2005.00992.x"
)
```
The following information is automatically propagated into the source field:
```{r, eval=FALSE}
primary:
key: Test_1
bibtype: Article
year: '2005'
author: Daniel S. Falster and Mark Westoby
journal: Journal of Ecology
title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
volume: '93'
number: '3'
pages: 521--535
doi: 10.1111/j.0022-0477.2005.00992.x
```
Once you've run this line of code, look at the metadata file to confirm:
1. the authors' names are formatted as `first name last name` or `first initial last name` (`Daniel S. Falster` or `D. S. Falster` if first names weren't available)\
2. sequential author's names are separated by `and`
3. the article title is in sentence case\
4. the page numbers are filled in as a range, separated by a double dash (`521--535` is correct)
*Note, there is also a function `metadata_add_source_bibtex` if your source information is in this format.*
------------------------------------------------------------------------
#### **Add location details**
Location data can be automatically propagated into the metadata file if it is available in tabular format. For instance, for this study:
```{r, eval=FALSE}
locations <-
read_csv("data/tutorial_dataset_1/data.csv") %>%
select(site, description, `latitude (deg)`, `longitude (deg)`) %>%
distinct()
```
You can then add this location information directly into the metadata file by running:
```{r, eval=FALSE}
metadata_add_locations(dataset_id = "tutorial_dataset_1", location_data = locations)
```
This leads to the following user prompts:
[Select column for location_name]{style="color:blue;"}
[1: site]{style="color:blue;"}\
[2: description]{style="color:blue;"}\
[3: latitude (deg)]{style="color:blue;"}\
[4: longitude (deg)]{style="color:blue;"}\
[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}
Select the same column that you indicated contained `location` names when you created the metadata template.
[Indicate all columns you wish to keep as distinct location_properties in tutorial_dataset_1 (by number separated by space; e.g. '1 2 4'):]{style="color:blue;"}\
[1: description]{style="color:blue;"}\
[2: latitude (deg)]{style="color:blue;"}\
[3: longitude (deg)]{style="color:blue;"}\
[Selection:]{style="color:blue;"} [**1 2 3**]{style="color:red;"}
Select all columns that include `location properties` that should be documented within the `metadata.yml` file. In this case, it is all three columns.
[Following locations added to metadata for tutorial_dataset_1: 'Atherton', 'Cape Tribulation']{style="color:blue;"}\
[with variables 'description', 'latitude (deg)', 'longitude (deg)']{style="color:blue;"}\
[Please complete information in data/tutorial_dataset_1/metadata.yml]{style="color:blue;"}\
All available location data has now been automatically added to the `metadata.yml` file.
```{r, eval=FALSE}
locations:
Atherton:
description: Tropical rain forest vegetation
latitude (deg): -17.117
longitude (deg): 145.65
Cape Tribulation:
description: Complex mesophyll vine forest in tropical rain forest
latitude (deg): -16.1
longitude (deg): 145.45
```
------------------------------------------------------------------------
#### **Add traits**
The next step is to select which columns in the `data.csv` file have trait information you want to include in the database.
The function `metadata_add_traits` automatically adds the trait-scaffold to `metadata.yml`:
```{r, eval=FALSE}
metadata_add_traits(dataset_id = "tutorial_dataset_1")
```
The user is prompted to select the columns with trait data.
[Indicate all columns you wish to keep as distinct traits in tutorial_dataset_1 (by number separated by space; e.g. '1 2 4'):]{style="color:blue;"}\
[1: Species]{style="color:blue;"}\
[2: site]{style="color:blue;"}\
[3: LMA (mg mm-2)]{style="color:blue;"}\
[4: Leaf nitrogen (mg mg-1)]{style="color:blue;"}\
[5: leaf size (mm2)]{style="color:blue;"}\
[6: latitude (deg)]{style="color:blue;"}\
[7: longitude (deg)]{style="color:blue;"}\
[8: description]{style="color:blue;"}\
[Selection:]{style="color:blue;"} [**3 4 5**]{style="color:red;"}\
You select columns 3, 4, 5, as these contain trait data.
[Following traits added to metadata for tutorial_dataset_1: 'LMA (mg mm-2)', 'Leaf nitrogen (mg mg-1)', 'leaf size (mm2)']{style="color:blue;"}\
[Please complete information in data/tutorial_dataset_1/metadata.yml]{style="color:blue;"}\
`metadata.yml` now includes a framework in which to manually fill in details about each trait:
```{r, eval=FALSE}
traits:
- var_in: LMA (mg mm-2)
unit_in: unknown
trait_name: unknown
entity_type: unknown
value_type: unknown
basis_of_value: unknown
replicates: unknown
methods: unknown
- var_in: Leaf nitrogen (mg mg-1)
unit_in: unknown
trait_name: unknown
entity_type: unknown
value_type: unknown
basis_of_value: unknown
replicates: unknown
methods: unknown
- var_in: leaf size (mm2)
unit_in: unknown
trait_name: unknown
entity_type: unknown
value_type: unknown
basis_of_value: unknown
replicates: unknown
methods: unknown
```
------------------------------------------------------------------------
### Manual filling in of metadata
The remaining fields within the metadata.yml file must now be filled in manually.
These include:\
* the `contributors` section\
* the `description`, `basis_of_record`, `life_stage`, `sampling_strategy`, `original_file`, and `notes` under the `dataset` section\
* details for each trait, including `unit_in`, `trait_name`, `entity_type`, `value_type`, `basis_of_record`, `replicates` and `methods`
These are all fields that contain the word `unknown`.
#### **Adding contributors**
| Contributor field | Information to add |
|----------------------|--------------------------------------------------|
| last_name, first_name | The contributors first and last names should be available from the source |
| ORCID | Contributors are identified by their ORCID, available for most active researches at [orcid.org](https://orcid.org/) |
| affiliation | Available from the source or the orcid.org website. Use the same syntax for the same affiliation throughout your database. |
| additional_role | For the lead dataset contributor, add the field: `additional_role: contact` |
- You can add multiple data collectors by duplicating the relevant 4 lines of code; see the [Adding dataset vignette](adding_data.html) for protocols on who to add as a data collector.
- The line `assistants:` can be deleted if there aren't any assistants' names to add.
- Add yourself as the `dataset_curator`.\
#### **Dataset fields**
- The file `data/tutorial_dataset_1/raw/tutorial_dataset_1_notes.txt` indicates how to fill in the `unknown` dataset fields for this study.
- In general, the information to fill in these fields should be available from the source (article) or obtained directly from the dataset contributor.\
| Dataset field | Information to add |
|---------------------|---------------------------------------------------|
| basis_of_record | See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| life_stage | See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| description | A 1-2 sentence summary of the dataset. This can generally be formulated by information in the abstract. |
| sampling_stategy | A description of how sites and sampling protocols were chosen. Can generally be taken verbatim from the methods section of a manuscript. |
| original_file | Name of the file submitted by the data contributor and archived in the raw folder. |
| notes | none (or `.na`) for this study, but any notes added by the data curator about data quality, edits to the data during dataset curation. |
#### **Trait details**
##### **trait_name**
The `trait_name` must match a `trait_name` within the [traits dictionary](https://github.com/traitecoevo/traits.build-template/blob/master/config/traits.yml). For this example:
| column in dataset | trait concept |
|-------------------------|---------------------|
| LMA (mg mm-2) | leaf_mass_per_area |
| Leaf nitrogen (mg mg-1) | leaf_N_per_dry_mass |
| leaf size (mm2) | leaf_area |
A dataset curator must be familiar with the likely traits in their discipline to accurately match those in a contributed dataset to traits in the dictionary, and be able to determine if a new trait definition is warranted.
##### **unit_in**
Units are formatted according to the [UCUM convention](https://ucum.org/ucum):
- units in the numerator are separated by a '.'/
- units in the denominator are each preceded by a '/'./
- "extra" information that is commonly informally included as part of the units for clarity can be included in curly brackets, {}
As examples:
| units | UCUM format |
|--------------------------------------------------|----------------|
| milligram per square millimetre | mg/mm2 |
| micromole per square metre second | umol/m2/s |
| micromole carbon dioxide per square metre second | umol{CO2}/m2/s |
If the units being read in for a specific trait differ from those defined for the trait in the [traits dictionary](https://github.com/traitecoevo/traits.build-template/blob/master/config/traits.yml) the trait values are converted using the conversion rules specified in [unit_conversions.csv](https://github.com/traitecoevo/traits.build-template/blob/master/config/unit_conversions.csv).
##### **entity_type, value_type, basis_of_value, replicates, methods**
| field | value for this dataset | description |
|------------------|------------------|------------------------------------|
| entity_type | population | The entity corresponding to the trait value. Uses a controlled vocabulary. See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| value_type | mean | The statistical nature of the trait value. Uses a controlled vocabulary. See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| basis_of_value | measurement | How the trait value was obtained. See [traits.build_schema](https://github.com/traitecoevo/traits.build/blob/develop/inst/support/traits.build_schema.yml) for allowable terms. |
| replicates | 3 | The number of replicate measurements that comprise the trait measurement recorded in the spreadsheet. |
| methods | See the study's [metadata_notes.txt](https://github.com/traitecoevo/traits.build-template/blob/data/tutorial_dataset_1/raw/metadata_notes.txt) file | A verbatim (free-form) text field documenting the methods used to collect the trait measurements. This is generally available from the reference or directly from the author. |
*The values for `entity_type`, `value_type`, `basis_of_value`, and `replicates` can vary by trait -- and indeed by measurement -- but for this study are identical for all traits.*
#### **Final steps**
##### **Double check the metadata.yml file**
You should now have a completed `metadata.yml` file, with no `unknown` fields.
You'll notice five sections we haven't used, `contexts`, `substitutions`, `taxonomic_updates`, `exclude_observations`, and `questions`.
These should each contain an `.na` (as in `substitutions: .na`). They will be explored in future lessons.
##### **Run tests on the metadata file**
Confirm there are no errors in the `metadata.yml` file:
```{r, eval=FALSE}
dataset_test("tutorial_dataset_1")
```
This *should* result in the following output:
[\[ FAIL 0 \| WARN 0 \| SKIP 0 \| PASS 79 \]]{style="color:blue;"}\
##### **Add dataset to the database** {#build_pipeline}
Next add the dataset_id to the build file that builds the database and rebuild the database
```{r, eval=FALSE}
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
```
##### **Build dataset report**
As a final step, build a report for the study
```{r, eval=FALSE}
traits.build_database$build_info$version <- "5.0.0"
# a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_1", traits.build_database, overwrite = TRUE)
```
Have a look at the report, but there reports become much more interesting once there are more datasets in the database.