This contains relevant questions and answers from common workflow issues and questions posted on Zulip.
This page is meant to be a list of FAQ regarding climate datasets, movivated by a variety of employees across UCAR/NCAR
Try one of the following resources.
- Xarray's How Do I do X? page
- Xarray Github Discussions
- Pangeo Discourse Forum
- NCAR Zulip under #python-questions, #python-dev, or #dask.
Avoid personal emails and prefer a public forum.
Open an issue here
See the xarray ecosystem page. Also see the xarray-contrib and pangeo-data organizations. Some NCAR relevant projects include:
- GeoCAT-comp
- GeoCAT-viz
- cf_xarray
- climpred
- eofs
- MetPy
- rechunker
- xclim
- xesmf
- xgcm
- pop-tools
- xskillscore
Dealing with Python environments can be tricky... a good place to start is to checkout this guide on dealing with Python environments
There are two main steps of installing conda
(miniconda in this case) on NCAR HPC resources
- Download miniconda within your work directory
- Install and activate your installation
There are a few videos which Anderson Banihirwe put together walking through this process - they are embedded below!
<iframe width="560" height="315" src="https://www.youtube.com/embed/GGxUgjlmW2A" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>You may want to move past just your base environment, and create a new conda environment! There are a few primary steps to this process:
-
Create the environment If you are creating an environment from scratch, use the following:
conda create --name
where
name
is the name of your environmentif you have an environment file (ex.
environment.yml
), use the following:Make sure you include the [`ipykernel`](https://github.com/ipython/ipykernel) package within your environment, which is required for your environment to be available from the [JupyterHub](https://jupyterhub.hpc.ucar.edu/)
conda env create -f environment.yml
-
Accessing your conda environment
This process will change depending on whether you are using an interactive Jupyter environment - I encourage you to check out the video which Anderson Banihirwe put together describing this process on NCAR HPC resources
<iframe width="560" height="315" src="https://www.youtube.com/embed/W4Jb6rY1w1w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
This is a very common issue when installing a new package or trying to update a package in an existing conda environment. This issue is usually manifested in a conda message along these lines:
environment Solving environment: failed with initial frozen solve. Retrying with flexible solve.
One solution to this issue is to use mamba
which is a drop-in replacement for conda. Mamba aims to greately speed up and improve conda functionality such as solving environment, installing packages, etc...
- Installing Mamba
conda install -n base -c conda-forge mamba
- Set
conda-forge
andnodefaults
channels
conda config --add channels nodefaults
conda config --add channels conda-forge
- To install a package with mamba, you just run
mamba install package_name
- To create/update an environment from an environment file, run:
mamba env update -f environment.yml
See mamba documentation for more.
The Computational and Information Systems Lab (CISL) at NCAR put together some good documentation on dealing with environments on Casper/Cheyenne
Even after running conda init bash
, you may notice that upon opening a terminal on the JupyterHub/NCAR HPC resources, your conda environment is not activated right away. You could call
bash
which would activate your conda environment! A better solution1 would be to ensure that your conda environment is activated upon login.
You can do this using the following snippet:
echo ". ~/.bashrc" >> ~/.bash_profile
- Read the xarray documentation on optimizing workflow with dask.
- Read the Best practices for dask array
- Keep track of chunk sizes throughout your workflow. This is especially important when reading in data using
xr.open_mfdataset
. Aim for 100-200MB size chunks. - Choose chunking appropriate to your analysis. If you're working with time series then chunk more in space and less along time.
- Avoid indexing with
.where
as much as possible. In particulate.where(..., drop=True)
will trigger a compute since it needs to know where NaNs are present to drop them. Instead see if you can write your statement as a.clip
,.sel
,.isel
, or.query
statement.
A good first place to start when reading in multiple files is Xarray's multi-file documentation.
For example, if you are trying to read in multiple files where you are interested in concatenating over the time dimension, here is an example of the xr.open_dataset
line would look like:
ds = xr.open_mfdataset(
files,
# Name of the dimension to concatenate along.
concat_dim="time",
# Attempt to auto-magically combine the given datasets into one by using dimension coordinates.
combine="by_coords",
# Specify chunks for dask - explained later
chunks={"lev": 1, "time": 500},
# Only data variables in which the dimension already appears are included.
data_vars="minimal",
# Only coordinates in which the dimension already appears are included.
coords="minimal",
# Skip comparing and pick variable from first dataset.
compat="override",
parallel=True,
)
An option is to use .compute(scheduler="single-threaded")
. This will run your code as a serial for loop. When an error is raised you can use the %debug
magic to drop in to the stack and debug from there. See this post for more debugging tips
in a serial context.
Keep an eye on the dask dashboard.
- If a lot of the bars in the Memory tab are orange, that means your workers are running out of memory. Reduce your chunk size.
- Try subsetting for just the variable(s) you need for example, if you are reading in a dataset with ~25 variables, and you only need
temperature
, just read in temperature. You can specificy which variables to read in by using the following syntax, following the example of the temperature variable.
ds = xr.open_dataset(file, data_vars=['temperature'])
- Take a look at your chunk size, it might not be optimized. When reading a file in using Xarray with Dask, a "general rule of thumb" is to keep your chunk size down to around 100 mb.
For example, let's say you trying to read in multiple files, each with ~600 time steps. This is case where each file is very large (several 10s of GB) and using Dask to help with data processing is essential.
You can check the size of each chunk by subsetting a single DataArray (ex. ds['temperature']
)
If you have very large chunks, try modifying the number of chunks you specify within xr.open_mfdataset(files, ..., chunks={'lev':1, "time": 500})
where lev and time are vertical and time dimensions respectively.
Check to see how large each chunk is after modifying the chunk size, and modify as necessary.
- You do not have enough dask workers
If you have a few large files, having the number of workers equal to to the number of input files read in using xr.open_mfdataset
would be a good practice
If you have a large number of smaller files, you may not run into this issue, and it is suggest you look at the other potential solutions.
Try the rechunker package.
Distributed writes to netCDF are hard.
- Try writing to
zarr
usingDataset.to_zarr
. - If you need to write to netCDF and your final dataset can fit in memory then use
dataset.load().to_netcdf(...)
. - If you really must write a big dataset to netCDF try using
save_mfdataset
(see here).
Dask worker requests are added to the job queues on Casper and Cheyenne with the cluster.scale()
method. After this method is called, you can verify that they are waiting in the queue with this command:
qstat -u <my_username>
on Cheyenne, and the same command will work on Casper after April 2021.
If you see no pending worker jobs, then verify that you have called cluster.scale()
.
Beginning August 13, 2021, Github will no longer accept account passwords when authenticating git operations. There are essentially two options, which Github provides proper documentation for getting setup:
A well known issue of CESM data is that timestamps for fields saved as averages are placed at the end of the averaging period. For instance, in the following example, the January/1920
average has a timestamp of February/1920
:
In [25]: filename = '/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/atm/proc/tseries/monthly/TS/b.e11.B20TRC5CNBDRD.f09_g16.011.cam.h0.TS.192001-200512.nc'
In [33]: ds = xr.open_dataset(filename)
In [34]: ds.time
Out[34]:
<xarray.DataArray 'time' (time: 1032)>
array([cftime.DatetimeNoLeap(1920, 2, 1, 0, 0, 0, 0),
cftime.DatetimeNoLeap(1920, 3, 1, 0, 0, 0, 0),
cftime.DatetimeNoLeap(1920, 4, 1, 0, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2005, 11, 1, 0, 0, 0, 0),
cftime.DatetimeNoLeap(2005, 12, 1, 0, 0, 0, 0),
cftime.DatetimeNoLeap(2006, 1, 1, 0, 0, 0, 0)], dtype=object)
Coordinates:
* time (time) object 1920-02-01 00:00:00 ... 2006-01-01 00:00:00
Attributes:
long_name: time
bounds: time_bnds
A temporary workaround is to fix the issue ourselves by computing new time axis by averaging the time bounds:
In [29]: import xarray as xr
In [30]: import cf_xarray # use cf-xarray so that we can use CF attributes
In [31]: filename = '/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/atm/proc/tseries/monthly/TS/b.e11.B20TRC5CNBDRD.f09_g16.011.cam.h0.TS.192001-200512.nc'
In [32]: ds = xr.open_dataset(filename)
In [34]: attrs, encoding = ds.time.attrs.copy(), ds.time.encoding.copy()
In [36]: time_bounds = ds.cf.get_bounds('time')
In [37]: time_bounds_dim_name = ds.cf.get_bounds_dim_name('time')
In [38]: ds = ds.assign_coords(time=time_bounds.mean(time_bounds_dim_name))
In [39]: ds.time.attrs, ds.time.encoding = attrs, encoding
In [40]: ds.time
Out[40]:
<xarray.DataArray 'time' (time: 1032)>
array([cftime.DatetimeNoLeap(1920, 1, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(1920, 2, 15, 0, 0, 0, 0),
cftime.DatetimeNoLeap(1920, 3, 16, 12, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2005, 10, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2005, 11, 16, 0, 0, 0, 0),
cftime.DatetimeNoLeap(2005, 12, 16, 12, 0, 0, 0)], dtype=object)
Coordinates:
* time (time) object 1920-01-16 12:00:00 ... 2005-12-16 12:00:00
Attributes:
long_name: time
bounds: time_bnds
cf-xarray can be installed via pip or conda. cf-xarray docs are available [here](https://cf-xarray.readthedocs.io/en/latest/).
Footnotes
-
Assuming you are using a bash terminal, which is the default on NCAR HPC ↩