Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array Ecosystem Interoperability #113

Open
pbower opened this issue Dec 28, 2024 · 5 comments
Open

Array Ecosystem Interoperability #113

pbower opened this issue Dec 28, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@pbower
Copy link

pbower commented Dec 28, 2024

Hey,

Awesome library Lachlan. Great to see another aussie getting into it.

Hey, I've got a quick around planned interoperability betwen Zarr Rs and other array frameworks, essentially for hooking them into the wider ecosystem.

Wondering if you have any plans for compatibility with the DL Pack protocol, or otherwise conversion to Numpy, this type of thing ?
Otherwise, is the memory layout pretty much the same and if pushed through to Python (e.g. via FFI, or PY03) will it essentially translate ?

Asking as the lazy loading chunked storage thing is really powerful, but so are the functions with that ecosystem. So, finding the line between the best of both worlds.

Thanks heaps,
Pete

@niklasmueboe
Copy link
Contributor

niklasmueboe commented Dec 29, 2024

Regarding the conversion to numpy you can have a look at pyo3-numpy which provides functionality for interfacing python-numpy with rust-ndarray.
Rust-ndarray is already supported by zarrs, so that should probably cover most use cases.

@LDeakin
Copy link
Owner

LDeakin commented Dec 29, 2024

Otherwise, is the memory layout pretty much the same and if pushed through to Python (e.g. via FFI, or PY03) will it essentially translate ?

We are already doing some numpy interop here: https://github.com/ilan-gold/zarrs-python. This package slots nicely into the Python ecosystem - users can continue to use zarr-python, xarray etc with a performance boost from zarrs.

Wondering if you have any plans for compatibility with the DL Pack protocol

@ilan-gold did look into using DLPack with zarrs-python for a bit, but we shifted strategy after this PR: ilan-gold/zarrs-python#22. Relevant:

@pbower , how would you envision using DLPack, would you want to stay entirely in Python land? I am hesitant to expose more of zarrs to Python (at least for now), because Python already has zarr-python (+ zarrs-python) and tensorstore for Zarr V3. Alternatively, it would probably be straightforward to add DLPack support to zarrs on the Rust side and leave Python interop to the zarrs consumer.

@ilan-gold
Copy link

ilan-gold commented Dec 29, 2024

Hi just to chime in, despite the theoretical niceties of DLPack, in practice it's probably easier to just use existing APIs i.e., return a numpy array from rust to python and then just use numpy's dlpack functionality. I'm not sure I see the added benefit of doing it in rust. I am considering even closing that issue in zarr-python because I tried implementing it, and having a defined API instead of just a DLPack struct is actually quite helpful (outweighing the benefit of dlpack under the hood in zarr-python) but I'll need to look back over it to be sure.

@pbower
Copy link
Author

pbower commented Dec 30, 2024

Hey @LDeakin and @ilan-gold ,

Thanks for the prompt and thorough response. It’s much appreciated.

also @niklasmueboe , thanks for the helpful link.

Firstly, regarding this ‘ Alternatively, it would probably be straightforward to add DLPack support to zarrs on the Rust side and leave Python interop to the zarrs consumer.’. Yes, this would be the most helpful. To share some context, this is to support application use cases where the main data objects are stored in rust, but call outs are made to python to help leverage that ecosystem (with various concurrency optimisations).

There are a few reasons that I view this as useful:

  1. highly concurrent web servers (e.g. tokio) , working directly with these arrays. Occasionally needing to send to python for processing (ffi py03 zero copy), but otherwise having a non-GIL locked fast read copy available.
  2. DL pack - endorsed by the consortium of python array standards as the interchange format. https://arrow.apache.org/docs/python/dlpack.html , like arrow but for arrays. Having this ready to go for direct interchange to a specific library assists with avoiding numpy as a transitory data structure, which can help with certain workloads at scale. Admittedly, this is similar to doing via Ndarray on the rust side, but given frameworks are now supporting DLPack this is super helpful for pluggability.

Appreciating these requirements may be viewed as niche, I’d therefore find it valuable to have DLpack pluggability on the rust side of this framework. Then there’s ‘chunk-based n-dimensional arrays’ available in the global array interchange format in what’s arguably the fastest/modern programming language, easily then pluggable with python / user facing ecosystem.

A comparable application back-end is https://tiledb.com/ , which seems aligned with the ice chunking approach in this framework.

Acknowledging though, it’s work on your side so if it’s not achievable at this time there’s no issues!

Thanks for considering it.

Regards,
Pete

@LDeakin LDeakin added the enhancement New feature or request label Dec 30, 2024
@ilan-gold
Copy link

ilan-gold commented Jan 7, 2025

@pbower Some more questions:

here the main data objects are stored in rust, but call outs are made to python

Does this mean you are calling out to python from inside rust on the objects in rust? I assume (although that's the extent of my knowledge) that using numpy-ndarray for rust-numpy interop is 0-copy. I would be surprised if there as a copy, and if there were, I would be interested in contributing to their repo a no-copy version. So you could then go from python numpy to dlpack without any further copy.

assists with avoiding numpy as a transitory data structure, which can help with certain workloads at scale.

I am actually curious about this. Do you have any experience with performance hits? We had dlpack for a bit and crudely speaking, we didn't see any performance hit from switching away from dlpack to using numpy-ndarray in rust. I was under the impression the whole objective of dlpack interop was to prevent any copying so I'm curious about the overhead involved with these 0-copy things.

Lastly, the dlpack-rust support is a bit minimal ATM. I tried out https://github.com/SunDoge/dlpark but it took me a non-trivial amount of time to get it working. I'm happy to share my experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants