-
Notifications
You must be signed in to change notification settings - Fork 0
Trait data template
On this page, we describe each column of the data template, provide an unambiguous definition as well as the range and unit of valid numerical values and the allowed factor levels. The descriptions and valid factor levels defined here will be compiled into a machine-readable glossary (ideally using xml) which serves as reference template for the R-script.
The column names are open for discussion and adjustment. They should follow a unified format (using capital letters to separate words, e.g. userName
), and ideally take names from Darwin Core and Extensions (Names already compliant with DWC are labelled thus). The Darwin Core Index can be found here: http://rs.tdwg.org/dwc/terms/index.htm
Generally, for the essential primary information (species name, trait name, trait value), we will keep the original naming of the data for compatibility on the provider's side and to check for inconsistencies and typos. The data provider's information is indexed by appending _user
to the column name. The R script helps in transferring those entries into the accepted names and factor levels provided by the lookup tables.
The 'visible' template, e.g. the information printed by the R-script or a manual xls sheet, will only show a minimal set of information to the user and autocomplete the table as far as possible.
contains the verified name identifier of the species or subspecies (or higher taxon) for which this value was collected. Each entry must be precisely matched with the species reference list! No synonyms! This may be auto-generated by the R-script.
For species that are not in the accepted species/taxon list, we would assign NA here. BExIS then should install mechanisms to add these missing names to the species list and fix the entry in the dataset.
Flo: The Darwin Core uses scientificName
to capture the name of a species as a character string and taxonID
for a dataset-specific unique identifier. Since we use it as an identifier and we allow for higher taxon levels as well, the latter is actually more appropriate.
Diana: TaxonID gives impression that a numeric ID is required here. Just 'taxon' might be more clear
a numerical ID. could be added at a later stage for computational reasons, but is not required since 1:1 matching with scientificName
Nadja: If we want to have the mechanism of adding missing names, I guess we need to have an ID number of those names which are already in the list
keep the information about the resolution of the taxon, to clarify cases where information is not given on a species level. This is automatically filled with data from the species list.
Nadia: We might want to restrict the values in taxonRank
to the most important level, though.
This column keeps the species name (or other taxon) that the author was using for their initial data table. Is maintained for documentation and for use on the authors side. After matching and translating in accepted species ID, this column will be hidden in the R preview.
Cat: Not sure it should be hidden in the preview, people might want to check if the synonyms are correct. Or they might be worried if they don't see "their" species names.
Martin: This column is not entirely clear, do you mean the original taxon name of the trait source, which might be a synonym to the taxonID? --> Nadja: I think in the final version with the R script, the user will first only see the columns with _user
to fill, but should be able to check how the R script changed taxon names later.
Unique identifier for the trait category and name. A value of format A1
, that maps to a trait table and the valid accepted trait name. We provide a trait table for different organism groups which are identified by a capital letter: A for arthropods (or I for invertebrates), P for plants, V for vertebrates.
Cat: if traitID
has letter+number it's a string (or character.. well, not a number) in this case it's not much more interesting than traitName
? (if this was already discussed, please just delete this and sorry!). Maybe this field has supplementary information? (e.g. traitName is body_mass but
traitID` also brings supplementary info on the organism group with the capital letter)
Flo: Agree. We can just keep a numeric ID here, but it must be indicated in the metadata which trait thesaurus this refers to. In the aggregated dataset (via BExIS tool) it would be compiled into something like BExIS:ArthropodTraits:11
or TRY:AcceptedTraitNames:32
Martin: what means value of format A1
? --> Nadja: I am afraid this might confuse the data owners, if we have them give the traitID instead of the traitName
. Or will the R script try to infer traitName
and traitID
from the information given in traitName_user
and the user just has to provide the information if the matching is not possible?
The trait name is a short character string value describing the trait (the values are taken from the trait table and do not contain spaces or punctuation except underscore). It is matching the numeric traitID
but added for human reading and data handling.
Diana: It might be easier for users to provide a character traitName, rather than a numeric ID, which is also prone to mistakes.
Flo: The R script might as well take either a numeric ID or a character string.
The short or long name that the author is using for this trait. Same intention as author_species_name
but for traits.
The standardized value measured for this replicate. (in the correct unit and factor levels). This is the mandatory entry and main content of the entire table.
The column is of mixed data type which will be validated for upload. We can assume that users know how to handle mixed columns. Proofchecking for a valid content format (numerical or categorical) or value (within range, or valid factor levels) in this column will be done by the R Script before upload. Only datasets that pass the quality check may be uploaded to BExIS.
Gives the expected unit for the trait value.
Dani thinks it would be useful to have a unit converter
optional. but highly recommended for reasons of documentation.
The authors raw-data value. May be of different unit or factorial classification. This must be mapped by the author into a standardised value which is ented in measurementValue
.
optional. but highly recommended for reasons of documentation.
Reports the unit that the author's raw data were measured in, if applicable (only for numeric values). The R script will check for a match with the unit expected according to the trait table and return a warning at mismatch. User may then update the unit or value. The warning is returned into a warning column?
Nadja: should the RScript also convert e.g. from cm to mm?
Flo: Thats a hard one, since it requires users to enter the unit in a precise and unambiguous format. What about m^2, or mm^3 could be done, but what about surface to volume ratios, respiration rates etc.. I would ask them to use the correct unit and rely on them doing it alright. We keep the user unit for reference, so that users would spot mistakes sooner or later.
Discussion: no automatic conversion is provided. we keep looking for solutions, check for matching units and return warning into warning column?
optional. The life stage of the measured specimen. Since naming differs very much across taxa, the field is open for user defined factor levels. Recommended entries are: adult
, egg
/seed
, seedling
, sapling
, larval_instar_1
, larval_instar_2
, larval_instar_3
, ... , pupae
Martin: Ask for lifestage before sex --> Nadja: I don't see why not, so I changed the order
optional. The age of the specimen in years.
Sebastian: Maybe specify that age in month should be indicated as e.g. 0.3 years.
optional. Defining the sex of the measured specimen, to capture dimorphisms. Takes the following values: male
or female
, subadult
, unknown
Cat: Do we also want something like hermaphrodite
(in case we add more organism groups in future)? I guess that for plants, the flower type (e.g. dioecious) is going to be a trait itself.
Flo: That makes me think that these columns are rather animal-specific. Are there sub-species distinctions that must be made for plants? lifeStage is open in its definition. age applies as well.
Martin: How do you define subadult? Probably you mean specimens in which it cannot be distinguished between sex yet. In contrast, unknown means that sex could have been given, but has not been checked, or? --> Nadja: I think this is a good idea and we should offer those two options.
The possible morphotypes depend very much on the taxon. It might be necessary to allow for user defined factor levels.
Give example factor levels, e.g. for ants "workers", "soldiers".
this section of columns aims for identifying the methodology and primary source of the data and keep the reference to the actual specimen (e.g. for museum collections or related data analysis).
basisOfRecord
(DWC)
mandatory.
How and from which kind of specimens were the data obtained? Options are: Taken by own measurement (distinguish LivingSpecimen
, PreservedSpecimen
, FossilSpecimen
) or taken from literature (literatureData
), from an existing trait database (traitDatabase
), or expert knowledge (expertKnowledge
).
Martin: I think it would be important to state how the specimens were preserved. Dry specimens are likely to shrink and those preserved in ethanol might have swelled abdomen. Additionally, I would make two columns out of it. In the first distinguish between measurement, literature compilation (in literature also single individual measurements could be given, e.g. in taxonomic works) and expert knowledge, in the second give detailed information. --> Nadja: Maybe we should make basisOfRecordDescription
mandatory as well. There, users can describe the preservation method, etc.
optional.
Adding more detail to the basisOfRecord. If life specimens were sampled, where did they come from: The exploratories or sampled elsewhere? Have they been reared in cultivation? If literature data, provide type of literature, e.g. textbook, website, database, etc.
optional. but highly encouraged.
This should contain precise reference to the source. If literature data, this should quote the book or online database. If museum this should report the name of the collection. If expert knowledge, this should name the authority.
applies for literature and expert knowledge data. not applying for measured data.
This would give the hierarchical level to which the trait data would refer. For measured data, its mostly The taxonomic rank of the most specific name in the scientificName. Recommended best practice is to use a controlled vocabulary. The GBIF recommended controlled value vocabulary can be found at http://rs.gbif.org/vocabulary/gbif/rank.xml
Discussion: clear explanation, not confuse with taxon rank. optional information, usually covered
Martin: what is the difference to taxonRank
?
applies to measured data.
The method used to measure a (numerical) trait value. Should be a concise and standardised reply, referring to a particular method, e.g. 'direct weighing', 'length-mass regression'. A more detailed description of or reference to (publication, URI) of the method or protocol used to determine the measurement, fact, characteristic, or assertion should be given in the metadata of the dataset. e.g. measurement at a certain temperature or humidity (any detail could be provided here).
Martin: The device used for measuring should be given somewhere --> Nadja: I think this would be added here, i.e. we need to be more specific in the description
optional. a character string encoding the person who conducted the measurement. Can be encoded by identifiers for reasons of privacy. This is kept as a co-factor for repeated measurements.
optional. the date of the measurement. Also kept as a co-factor. Use format YYYY-MM-DD hh:mm
optional.
any particularities about the individual measurement or specimen that might affect trait measurement, e.g. 'missing left front leg'.
Martin: I do not understand the example “left frontleg missing”. If this is not measured it does not be mentioned?! Better “last antenna segment missing”? --> Nadja: This is true, if the leg was missing, there should be no value for it. This raises an interesting aspect: what if people want to add missing values as NA in cases where they measured different traits on one individual? Having this information as a missing value in the dataset would make it clear to other users that this value could not be measured rather than it was just forgotten.
Some data may not refer to a single specimen or measurement, but aggregate repeated measurements into an average.
Diana: This section might be confusing users. It should be clearly pointed out, that replicated measures should be entered as individual rows, not aggregated measures. The aggregated fields are only offered as a fall-back, if no individual raw data are available.
Flo: those fields are also meaningless for literature data and factorial data that refer to the genus level. They are neither individual measures nor aggregate. Thus it must be clear that those are only to be filled for measured data! The whole section might be placed further down. Or we could create separate sections for Measured data and Literature/Compiled data.
Dani: data could be aggregated at the individual level, not only at the species level. E.g. 1 individual can have a mean value from 5 repeated measures on one single leave or from a measure on 5 different leaves.
TRUE or FALSE, defaults to FALSE.
This is flagging aggregate data in an unambiguous way. could be auto-generated if NrIndividuals is provided.
If replicate contains an aggregate measure of multiple individuals, this is to be reported here. Usually as count, i.e. integer number. Defaults to 1.
If multiple individuals, the variance or standard deviation for the value is to be reported here. Defaults to 0. If a value is provided, report the statistical method in the field statisticalMethod
.
Cat: Maybe it should default to 0 if NrIndividuals
=1, otherwise should default to NA.
Flo: Also, it must be clear if this refers to the users data or the standardized data that should be provided in measurementValue
Instead of or in addition to variance, the range of values should be reported for aggregate measures.
Dani: people sometimes report 95% interval to remove outlier problems (instead of min/max)
optional. a character string, reporting e.g. weighting accuracy of scale or the precision of the method, the error sources etc. Number of decimal places recorded.
optional.
for aggregated measures, report the method for data aggregation or averaging as well as the variation or range. E.g. 'mean and standard deviation', 'median and 95% confidence interval', 'mean and variance', 'mean and range of values', 'median and 95% interquantile range'
Diana: better report this in two columns (one information per column), one for central tendency measure and one for measure of variation.
A character string of the author code or museum ID referring to the individual specimen from which the data were measured. This is important for the analysis of co-variation of morphometric data. This should couple measurements on a single specimen, which also could be a leaf or a single bone without an assignment to an individual organism.
If available, upload related dataset to describe specimens more precisely, e.g. environmental parameters or identity related information.
Diana: Very valuable information. Encourage use of this column for morphometric data and better explain which form of ID is expected here: User defined ID scheme that might be further explained in an extra dataset.
Dani: it is a bit confusing not to have this at the beginning of the table because for plant traits the individual is often the unit. And aggregation is done at the individual level (e.g. 1 line is the mean of 5 leaves of individual x from species y). See also aggregation
section.
The sampling event or campaign. User specified character string. could link to another table that provides more information, e.g. environmental or temporal parameters, descriptions on methods etc.. should link to official Exploratories sampling campaigns.
year
, month
, day
or eventDate
(DWC)
Optional. Does not apply for literature data.
Represents the date, at which the specimens were extracted or sampled. If providing eventDate
, then enter format of type: YYYY-MM-DD hh:mm (fall back to 12:00 if no time is available). For lower precision, use year
, month
and day
field instead.
Note: this is not to record the date when the specimens were measured (use measurementDeterminedDate
for this). If applicable, at least provide a year. This is particularly important if the measurements were taken from specimens from the exploratories or museum specimens.
Diana: remove eventDate since redundant with year/month/day --> Nadja: Maybe day
is really not necessary. If one knows the day of sampling, they should be able to give the eventDate
. Month or year might be broader resolutions, especially if specimen were collected in activity traps over a certain time period.
This section records location in the context of the exploratories as well as in a broader global context. For data obtained within the Exploratories, provide ExploratotriesPlotID
and the rest will be autocompleted by the R script. Additional spatial resolution, e.g. on subplots, may be provided in locationID
.
Flo: in TRY, these additional information are entered as coupled rows, which are linked by the observationID field. I think this is very bad for data handling, but probably due to the fact that authors want to add many different kinds of information. Our Location or sub-species resolution information would have to be transferred into this format as well, if people wanted to upload the same data to TRY.
TRUE or FALSE. defaults to FALSE.
As unambiguous flag for data and specimens that are originating from Exploratories sites.
optional, but highly recommended for trait data obtained from specimens from the Biodiversity Exploratories.
EP plot ID (or also any valid Gridplot ID or VIP ID) where the measured specimen was extracted. Only for specimen that were extracted from the exploratories directly (or direct offspring, if hatched in the lab). Please also report it, even if this was not part of your research question and provide a Date (a year at last) if available.
Cat: looking at this and the .xlsx template, I think that people will tend to fill first the mandatory columns and forget a bit about the optional ones. So for columns like this one, we could create a third category to draw their attention. E.g. "mandatory", "optional", "highly_recommended" (or something similar). But could also be confusing.
Exploratories Region (Hainich, Schorfheide, Alb) for sorting purpose and readability, or if exact Plot ID is not available. Report the Region in format: A, H, S. This is automatically extracted by the R script from ExploratoriesPlotID
, if provided.
W
for forest or G
for grassland plot. This is automatically extracted by the R script from ExploratoriesPlotID
, if provided.
A unique, dataset-specific identifier. Could report the subplot within the Exploratories, or the plot or replicate of a separate experimental setting. If further information is available, specify locationIDs in a separate dataset.
Dani: this might be confusing, because it could be a treatment as well (so could be a different "environment type" or location itself). Probably title should be more self-explanatory.
A unique identifier for the observation event of this particular specimen.
Andreas: Would be good to have occurenceID. Just in case we want to share data with GBIF. They want to have that ID. We (BExIS) used the following format: DatasetID-PlotID-sequentialNumber
Flo: This would allow the trait data to double as occurence data. It would only apply to specimens collected in the exploratories and is redundant with the specimenID and PlotID. We can create that ID in the R Script but it seems a GBIF centered demand. Since GBIF requires some mapping anyway, this could be extracted in the process of transferring the data, right?
optional.
A character string reporting habitat type from which the specimen was sampled. E.g. 'forest', 'grassland', 'oak savanna', 'pre-cordilleran steppe'.
Martin: habitat
and ExploType
are confusing. By giving the EP code it is clear from which plot of the Exploratories the specimen was sampled and the habitat should be broader than grasslands and forests to cover also specimens from other sources, but still include grassland and forest. The extra column ExploType is therefore not needed, at least in my opinion. --> Nadja: ExploType
would be hidden anyway, but maybe it is really not necessary to have it if the R script adds this information in the column habitat
(and not in ExploType
) when a valid plotID is given?
decimalLongitude
, decimalLatitude
, ElevationInMeters
and geodeticDatum
and coordinatePrecision
(DWC)
Optional. Defaults to NA.
provide Georeference, if available ("WGS84"). May be automatically generated if ExploratoryPlotID
is provided.
The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive.
The geodeticDatum
defines the ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in decimalLatitude and decimalLongitude as based. Recommended system is "WGS84". Use the EPSG code to provide an SRS. Examples: "EPSG:4326", "WGS84", "NAD27", "Campo Inchauspe", "European 1950", "Clarke 1866"*
e.g. 'Germany' and 'DE'. This should be added if a more precise location is unavailable, to enable data to be used by GBif.
open text field for information on geolocation, could be anything, e.g. 'Brasil', 'Darmstadt, Hesse, Germany', 'Obere Fischerwiese'. DWC defines: "The specific description of the place. Less specific geographic information can be provided in other geographic terms (higherGeography, continent, country, stateProvince, county, municipality, waterBody, island, islandGroup). This term may contain information modified from the original to correct perceived errors or standardize the description."
This is an auto-generated unique identifier for each entry of data. This allows a row-wise comparison of data from different versions and sources and an elimination of duplicates. This may be computed by the R-script by using MD5 or SH256 hashing algorithms. Thus, any row with identical content will produce the same ID, even if they were uploaded from different users.
This might get confusing if further columns are added to the dataset in later versions of the template, which would lead to different IDs although there is identical content. Thus, measurementID always would be dependent on the template version. The R-script should facilitate a transfer between different versions.
open text field for any remarks concerning a particular measurement, e.g. additional information on the quality and reliability, reference or acknowledgements.
warnings from the R-script will be stored here, e.g. regarding a lack of match between the provided taxonID and the ontology, or the trait names or values, a mismatch in the units provided and the unit expected according to the trait table. User defined warnings and flags can be added as well, e.g. 'NOTUSE'.
We agreed to keep the Metadata separate from the raw data. Once trait datasets can be aggregated on BExIS via a trait data tool, this will compile dataset-specific information into additional columns (a dataset ID at least). This will be the recommended way to download trait data from BExIS since it makes sure the data remain linked to the relevant metainformation. The investment into the development of a traitdata tool on BExIS1 will depend on the amount of data uploaded.
As a low-threshold alternative, an R script might be provided that merges individually downloaded datasets and their metadata files (xml). Maybe in addition, there could be one single metadata table for all traitdatasets on BExIS which provides the authors, owners, contact details, and a bibliographic reference, and which is openly accessible.
In the end, the responsibility in giving credit to original data owners and involving them into the research is with the synthesis data user.
The following metadata should be collated by the aggregation tool:
to just have the name of the person here who is creating/checking/uploading the specific dataset
TRY requires Author_lastname
and Author_firstname
separately. Is there a way to auto-generate that from the metadata stored in BExIS? (for compatibility with downstream databases, e.g. TRY)
authors and co-authors would be given credit in a bibliographicCitation field.
contact information required by data users to inform and involve data providers.
A literature reference, or online location including author names and version, as well as a dataset DOI, if applicable.
The DOI, ISBN, URI etc that points to the original publication of the dataset. this can be a BExIS ID, once they are related to a DOI.
This field would capture the information about how and under which conditions to reuse and republish the data and notify the author about its use. A couple of standard licenses could be suggested, like cc-0
, cc-int-by
, bexis
(default biodiversity exploratories data sharing agreement). The notify
field could just be a 'yes'/'no' field saying if the author wants to be involved for projects that use the data. The default for BExIS use would be yes, but this would allow data owners to opt-out.
the trait template that was used to create the dataset. Since an updated version of the template may add further columns or eliminate and redefine columns, this reference is necessary for comparing data across template schemes.
This could be a signature that the Dataset passed all mandatory checks of the R-script.
Which trait thesaurus do the entries in columns traitID and traitName refer to.
Which species ontology do the entries in columns taxonID refer to?
We might provide the option to give more detailed information on where the specimen was sampled (e.g. the plant species, canopy, understorey, herbs, soil etc). This might give us the opportunity to analyses trait variation in relation to food plant etc. in future. Therefore we should encourage people to provide this data.
Sampling method might also be useful, as some measurements might be affected by this (e.g. color)
- Would be very cool to have a tool in the R package to convert speciesxtrait tables into the template.
Caterina: I agree, maybe we can just help users to melt their tables with some example code or make a function that fills the template from speciesxtrait tables, save them and tell them which columns they still have to fill.
-
Maybe add a column defining how the trait was measured. E.g. plant height can be measured in different ways, similarly to body size in animals (include head, tail..etc).
-
It is somehow confusing not to have the individual/specimenID as a basic unit
Caterina: maybe we need a "resolution" column to know if data are collected at the species>individual>leave>.. level. Could we have this information in the taxonRank
field?
We are aiming for compatibility with this database, which means that our species lists and trait lists should be matching and the trait template should be enabled to capture all information categories that the TRY database offers. The TRY data table has the following columns:
TRY Column | Comment | our column |
---|---|---|
LastName | Surname of data contributor | Metadata:authorLastname |
FirstName | First name of data contributor | Metadata:authorFirstname |
DatasetID | ID of contributed dataset | Metadata:BexisID |
Dataset | Name of contributed dataset | Metadata:DatasetName |
SpeciesName | Original name of species | TaxonName_user |
AccSpeciesID | Consolidated species name | TaxonID |
AccSpeciesName | Consolidated species name | TaxonName |
ObservationID | Identifier for different measurements of the same observation | SpecimenID |
ObsDataID | Unique identifier for each record | eventID ? |
TraitID | Identifier for traits (only if the record is a trait) | traitID |
TraitName | Name of trait (only if the record is a trait) | traitName |
DataID | Identifier for each sub-trait or context information | NA |
DataName | Name of sub-trait or context information | NA |
OriginalName | Original Name of sub-trait or context information | NA |
OrigValueStr | Original value as text string | MeasurementValue_user |
OrigUnitStr | Original unit as text string | MeasurementUnit_user |
ValueKindName | Value kind (single measurement, mean, median, etc.) | statisticalMethod ? |
OrigUncertaintyStr | Original uncertainty as text string | variation ? |
UncertaintyName | Kind of uncertainty (standard deviation, standard error,...) | |
Replicates | Count of replicates | NrOfIndividuals |
StdValue | Standardized value (not available in all cases) | MeasurementValue |
StdUnit | Standard unit: always available for standardized traits | MeasurementUnit |
RelUncertaintyPercent | Relative uncertainty in % | variation ? |
OrigObsDataID | Identifier for duplicate entries | measurementID |
ErrorRisk | Identifier for outliers: distance to mean in standard deviations | |
Reference | Reference to be cited if trait record is used in analysis | Metadata:BibliographicReference |
The sub-trait or context information in TRY (e.g. location) is recorded as coupled rows of data, linked by DataID.