-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling materialization of lazy arrays #748
Comments
I think this topic will have to be addressed in v2024, as it's too big to be squeezed in v2023 which we're trying very hard to wrap up 😅 |
A few quick comments:
|
No pressure. 😉
Thanks Ralf -- That'd be a big help indeed. Materializing an entire array as opposed to one element is something that should be a common API across libraries, IMHO, I changed the title to reflect that. |
Cross linking #728 as it may be relevant to this discussion. |
Just wanted to point out that it may be common but not universal. For instance, ndonnx arrays may not have any data that can be materialized. Such arrays do have data types and shapes and enable ONNX export of Array API compatible code. ONNX models are serializable computation graphs that you can load later, and so these "data-less" arrays denote model inputs that can be supplied at an entirely different point in time (in a completely different environment). There are some inherently eager functions like |
I think that xarray ends up surfacing closely-related issues to this - see pydata/xarray#8733 (comment) for a summary of the problems. |
One thing I've been using is
Pre-allocated arrays make a big difference in my mind in big data applications. |
Thanks for sharing @TomNicholas. I stared at your linked xarray comment for a while, but am missing too much context to fully understand that I'm afraid. You're touching on
Dask does (A), at least when one calls @TomNicholas are you hitting case (C) here with Xarray? And if so, is that for interchange between libraries only, or something else? |
Sorry @rgommers ! I'll try to explain the context here: In xarray we try to wrap all sorts of arrays, including multiple types of lazy arrays. Originally xarray wrapped numpy arrays, then it gained an intermediate layer of its own internal lazy indexing classes which wrap numpy arrays, then it also gained the ability to wrap dask arrays (but special-cased them). More recently I tried to generalize this so that xarray could wrap other lazily-evaluated chunked arrays (in particular cubed arrays, which act like a drop-in replacement for A common problem is different semantics for computing the lazy array type. Coercing to numpy via Dask and Cubed are also special in that they have More recently again we've realised there's another type of array we want to wrap: chunked arrays that are not necessarily computable. This is what that issue I linked was originally about. The comment I linked to is trying to suggest how we might separate out and distinguish between all these cases from within xarray, with the maximum amount of things "just working".
Not really - I'm mostly just talking about lazy/duck array -> numpy so far. |
Background
Some colleagues and me were doing some work on
sparse
when we stumbled onto a limitation of the current Array API Standard, and @kgryte was kind enough to point out that it might have some wider implications than justsparse
, so it would be prudent to discuss it with other relevant parties within the community before settling on an API design to avoid fragmentation.Problem Statement
There are two notable things missing from the Array API standard today, which
sparse
, and potentially Dask, JAX and other relevant libraries might also need.sparse
, this would be the format of the sparse array (CRS
,CCS
,COO
, ...).sparse
/JAX might use this to build up kernels before running a computationPotential solutions
Overload the
Array.device
attribute and theArray.to_device
method.One option is to overload the objects returned/accepted by these to contain a device + storage object. Something like the following:
To materialize an array, one could use
to_device(default_device())
(possible after #689 is merged).Advantages
As far as I can see, it's compatible with how the Array API standard works today.
Disadvantages
We're mixing the concepts of an execution context and storage format, and in particular overloading operators in a rather weird way.
Introduce an
Array.format
attribute andArray.to_format
method.Advantages
We can get the API right, maybe even introduce
xp.can_mix_formats(...)
.Disadvantages
Would need to wait till the 2024 revision of the standard at least.
Tagging potentially interested parties:
The text was updated successfully, but these errors were encountered: