-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing to H5AD fails if AnnData contains datetime64 objects #455
Comments
We are aiming to support more types, as well as an interface for registering extensions, for reading and writing. Date times seem like a good case for first party support, depending on how complicated this gets. I'm worried this is going to be complicated. Some points on this:
|
Yeah, I agree it could get complicated so up to maintainers whether it is worth the effort. The |
Timezones are hella complicated and change all the time, so what we shouldn’t do is convert them to offsets and store them that way I think … |
@flying-sheep, I think we would need to store them in a way that Looking at the arrow implementation, this might actually be straightforward. A date time column would be stored as a dataset of 64 bit integers. The attributes include the time unit (typically nanoseconds) and optionally a timezone. All times in a column must share a timezone (pandas represents these as Draft functions (these will need decorators for dispatch, additional metadata etc.): import numpy as np
import pandas as pd
import h5py
def write_datetime(f, k, v, *, dataset_kwargs={}):
d = f.create_dataset(k, data=v.view(np.int64), **dataset_kwargs)
d.attrs["unit"] = np.datetime_data(v)[0]
def read_datetime(d: h5py.Dataset) -> np.ndarray:
unit = d.attrs["unit"]
return d[...].view(f"datetime[{unit}]")
def write_timestamp(f, k, v, *, dataset_kwargs={}):
write_datetime(f, k, v, dataset_kwargs=dataset_kwargs)
if v.tz is not None:
d.attrs["tz"] = v.tz
def read_timestamp(d: h5py.Dataset) -> pd.arrays.DatetimeArray:
arr = read_datetime(d)
if "tz" in d.attrs:
tz = d.attrs["tz"]
unit = d.attrs["unit"]
dtype = pd.DatetimeTZDtype(unit=unit, tz=tz)
else:
dtype = arr.dtype
return pd.arrays.DatetimeArray(arr, dtype=dtype) This should be similar for |
So, it turns out datetimes are super complicated and contentious. To some extent we are beholden to whatever pandas does, but it looks like what pandas does is under active development (pandas-dev/pandas#40932). Some notes on the state of time types in the our dependencies:
Other libraries:
|
Writing an
AnnData
object to disk fails if it contains numpydatetime64
objects. Example:ERROR MESSAGE
This behaviour is from h5py rather than anndata. From v3.0.0 they suggest storing these values using opaque dtypes https://docs.h5py.org/en/stable/special.html#opaque-dtypes.
Not sure how big a problem this is but it came up here theislab/zellkonverter#24 so I just wanted to raise it. Would be nice if anndata could handle this (at least with a nicer error) but maybe not a priority.
The text was updated successfully, but these errors were encountered: