rubisco-sfa · nocollier · Jul 19, 2024 · Jul 10, 2024 · Jul 19, 2024 · Jul 19, 2024
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,71 @@
+## Thank you for contributing to ILAMB-Data. Use this template when submitting a pull request.
+
+### 🛠 Issue reference
+✨**Include a reference to the issue # here.**
+Summarize the changes you are making to the repo. Are you fixing a bug? Adding a dataset? Describe what you did in a few sentences or bullet points.
+
+### 🧪 Testing
+✨**Add an `x` between the brackets to indicate basic testing was completed**
+- [ ] I inspected the outputted NetCDF and checked for obvious errors
+- [ ] I visualized the outputted NetCDF at a timestamp to check for obvious visual errors
+- [ ] I compared my `convert` script to recent existing ones
+- [ ] I attempted to create/encode the NetCDF according to CF Compliance guidelines
+
+### 🧪 (Optional) Preview
+✨**Attach an image of your dataset here**
+
+### 🏎 (Optional) Quality Checklist
+✨**Add an `x` between the brackets to indicate script quality adherence**
+
+- [ ] There are no unused libraries imported into the code
+- [ ] There are no erroneous console logs, debuggers, or leftover testing code
+- [ ] There are no hard-coded paths to your local machine in the script
+- [ ] Useful headings describe sections of your code
+- [ ] Variable names are generalizable and easy to read
+
+### 📏 (Optional) CF Compliance In-Depth Checklist
+✨**Add an `x` between the brackets to ensure CF compliance**
+
+#### Dimensions
+- [ ] Dimensions include `time` with attributes/encoding:
+    - [ ] `axis` attribute is `T`
+    - [ ] `units` attribute/encoding is `days since YYYY-MM-DD HH:MM:SS`
+    - [ ] `long_name` attribute is `time`
+    - [ ] `calendar` encoding is `noleap`
+    - [ ] `bounds` encoding is `time_bounds`
+- [ ] Dimensions include `lon` with attributes:
+    - [ ] `axis` attribute is `X`
+    - [ ] `units` attribute is `degrees_east`
+    - [ ] `long_name` attribute is  `longitude`
+- [ ] Dimensions include `lat` with attributes:
+    - [ ] `axis` attribute is `Y`
+    - [ ] `units` attribute is `degrees_north`
+    - [ ] `long_name` attribute is `latitude`
+- [ ] Dimensions include `nv`, which is an array of length `2` that contains the start date and end date bounding the dataset
+
+#### Data Variables and their Attributes
+- [ ] **The variable(s) for model comparison are present**
+    - [ ] the variables are linked to the `time`,`lat`, and `lon` dimensions
+    - [ ] `long_name` attribute is specified
+    - [ ] `units` attribute is specified
+    - [ ] (If necessary) `ancillary_variables` attribute is specified if an uncertainty value is provided
+    - [ ] (Optional) Float32 data type
+    - [ ] (Optional) No-data values masked as NaN
+- [ ] **If applicable, a data uncertainty variable is present** (e.g., standard_deviation or standard_error)
+    - [ ] the variable is linked to the `time`, `lat`, and `lon` dimensions
+    - [ ] `long_name` attribute is specified (e.g., cSoil standard_deviation)
+    - [ ] `units` attribute is specified; it is unitless, so it should be `1`
+- [ ] **A time_bounds variable is present**
+    - [ ] the variable is linked to the `time` and `nv` dimensions
+    - [ ] `long_name` attribute is specified as `time_bounds`
+    - [ ] `units` is encoded as `days since YYYY-MM-DD HH:MM:SS`
+
+#### Global Attributes
+- [ ] 'title'
+- [ ] 'institution'
+- [ ] 'source'
+- [ ] 'history'
+- [ ] 'references'
+    - [ ] @reference-type{ref-name, author={Last, First and}, title={}, journal={}, year={}, volume={}, pages={num--num}, doi={no-https}}
+- [ ] 'comment'
+- [ ] 'Conventions'
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,9 @@
 *.tif
 *.nc
 *.ipynb
+*.sha256
 *.sqlite
 *.ipynb_checkpoints
 ILAMB_sample/
 HWSD2_RASTER/
-wise_30sec_v1/
+wise_30sec_v1/
diff --git a/README.md b/README.md
@@ -1,34 +1,48 @@
 ILAMB-Data
 ==========
 
-This repository stores the scripts used to download observational data from various sources and format it in a [CF-compliant](http://cfconventions.org/), netCDF4 file which can be used for model benchmarking via [ILAMB](https://github.com/rubisco-sfa/ILAMB).
+This repository stores the scripts used to download observational data from various sources and format it into a [CF-compliant](http://cfconventions.org/) netCDF4 file which can be used for model benchmarking via [ILAMB](https://github.com/rubisco-sfa/ILAMB).
 
-Please note that the repository contains no data. If you need to download our observational data, please see the [`ilamb-fetch`](https://www.ilamb.org/doc/ilamb_fetch.html) tutorial. This collection of scripts is to:
+Please note that the repository contains no data. If you need to download our observational data, see the [`ilamb-fetch`](https://www.ilamb.org/doc/ilamb_fetch.html) tutorial. This collection of scripts is to:
 
-* archive how we have produced the datasets compared to models with ILAMB
+* archive how we have produced the model comparison datasets with ILAMB
 * expose the details of our formatting choices for transparency
 * provide the community a path to contributing new datasets as well as pointing out errors in the current collection
 
 Contributing
 ============
 
-If you have an suggestion or issue with the observational data ILAMB uses, we encourage you to use the issue tracker associated with this repository rather than that of the ILAMB codebase. This is because the ILAMB codebase is meant to be a general framework for model-data intercomparison and ignorant of the source of the observational data. Here are a few ways you can contribute to this work:
+If you have a suggestion or issue with the observational data ILAMB uses, we encourage you to use the issue tracker associated with this repository rather than that of the ILAMB codebase. This is because the ILAMB codebase is meant to be a general framework for model-data intercomparison and is ignorant of observational data sources. Here are a few ways you can contribute to this work:
 
-* If you notice an irregularity/bug/error with a dataset in our collection, please raise an issue here with the tag `bug`. We also welcome pull requests which fix these errors, but please first raise an issue to give a record and location where we can have a dialog about the issue.
-* If you know of a dataset which would be a great addition to ILAMB, raise an issue here with the tag `new dataset`. Please provide us with details of where we can find the dataset as well as some reasoning for the recommendation.
-* We also encourage pull requests with scripts that encode new datasets and will provide more information about procedure in the next section.
+#### Debugging
+If you notice an irregularity/bug/error with a dataset in our collection: 
+1. Raise an issue with the dataset name included in the title (e.g., "Netcdf read error in wang2021.nc") for record keeping and discussion
+2. Tag the issue with `bug`
+3. (Optional) Fork the ILAMB-Data repo and fix the erroneous convert.py
+4. (Optional) Submit a pull request for our review
 
-Formatting Guidelines
-=====================
+#### Suggesting Datasets
+If you know of a dataset that would be a great addition to ILAMB:
+1. Raise an issue with the proposed dataset name included in the title (e.g., New Global Forest Watch cSoil dataset).
+2. Tag the issue with `new dataset`.
+3. Provide us with details of the dataset as well as some reasoning for the recommendation; consider including hyperlinks to papers, websites, etc.
+4. (Optional) Fork the ILAMB-Data repo, create a new directory named after the dataset (e.g., GFW), and create a `convert` file to preprocesses and formats the data for ILAMB.
+5. (Optional) Submit a pull request with the new directory and `convert` script for our review.
 
-We appreciate the community interest in improving ILAMB. We believe that more quality observational constraints will lead to a better Earth system model ecosystem and so are always interested in new observational data. We ask that you follow this procedure.
+**See below for specific guidelines on adding new datasets**
 
-* Before you encode the dataset, you should first search the open and closed issues here on the issue tracker. It may be we have someone already assigned to work on this and do not want to waste your effort. It may also be that we have considered adding the dataset and have a reason its quality is not sufficient.
-* If no issue is found, raise a new issue with the tag `new dataset`. This will allow for some discussion and let us know you intend on doing the work.
-* You may use any language you wish to encode the dataset, but we strongly encourage you to use python3 if at all possible. You can find examples in this repository to use as a guide. See this [tutorial](https://www.ilamb.org/doc/format_data.html) for details and feel free to ask questions in the issue corresponding to the dataset you are adding.
-* Once you have formatted the dataset, we recommend running it against a collection of models and along with other relevant observational datasets using ILAMB. There are [tutorials](https://www.ilamb.org/doc) to help you do this. This will allow the community to evaluate the new addition and decide on if or how it should be included into the curated collection.
-* After you have these results, attend one of our conference calls where you can present the results of the intercomparison and the group can discuss. Once the group agrees, then you can submit a pull request and your addition will be included.
+Dataset Formatting Guidelines
+=============================
 
+We appreciate the community interest in improving ILAMB. We believe that more quality observational constraints will lead to a better Earth system model ecosystem, so we are always interested in new observational data. We ask that you follow this procedure for adding new datasets:
 
+1. **Before encoding the dataset, search the open and closed issues in the issue tracker.** We may already have someone assigned to work on this and do not want to waste your effort. Or, we have considered adding the dataset and reasoned against it after discussion.
+2. **If no open or closed issue is found, raise a new issue** with the new dataset name in the title, and be sure to add the `new dataset` tag.
+3. **Create a new directory to work in.** We generally name it after the folks/project who made the dataset; name it whatever you like.
+4. **Write the conversion (e.g., `convert.py`) file inside the folder you created,** which (optionally) downloads the dataset; loads the dataset; formats it into a netcdf that follows updated [CF Conventions](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html); and, if a gridded dataset, it's helpful to resample to 0.5 degrees (EPSG:4326). Lastly, try to format variable names and units accoding to the [accepted MIP variables](https://clipc-services.ceda.ac.uk/dreq/index/var.html) for easier model comparison.
+5. **Submit a pull request** for us to review the script and outputted dataset.
 
 
+You may use any language you wish to encode the dataset, but we strongly encourage the use of python3. You can find examples in this repository to use as a guide. The [GFW convert.py](https://github.com/rubisco-sfa/ILAMB-Data/blob/master/GFW/convert.py) is a recent gridded data example, and this [Ameriflux convert.py](https://github.com/rubisco-sfa/ILAMB-Data/blob/master/Ameriflux/Diurnal/AMFtoNetCDF4.py) is a recent point dataset example. See this [tutorial](https://www.ilamb.org/doc/format_data.html) for help, and feel free to ask questions in the issue you've created for the dataset.
+* Once you have formatted the dataset, we recommend running it against a collection of models, along with other relevant observational datasets using ILAMB. There are [tutorials](https://www.ilamb.org/doc) to help you do this. This will allow the community to evaluate the new addition and decide if or how it should be included in the curated collection.
+* After you have these results, consider attending one of our conference calls. Here, you can present the results of the intercomparison, and the group can discuss.