-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial direct I/O support in FilesystemStore
#58
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #58 +/- ##
==========================================
+ Coverage 82.26% 82.28% +0.02%
==========================================
Files 160 160
Lines 22183 22228 +45
==========================================
+ Hits 18249 18291 +42
- Misses 3934 3937 +3 ☔ View full report in Codecov by Sentry. |
This is great stuff, thanks for this. I've not had time to look at this thoroughly or test it, but here are my first thoughts. Open questions
This has so much overlap with the standard filesystem store; so just keep it as an option.
What about a
Let's aim for incremental improvements and get this in soon.
Yep, the rest of the crate panics on allocation failure
I wonder how
Fortunately is is feasible to allocate aligned Reducing allocationsWould you benefit from a function like // unsafe because the responsibility is on the caller to ensure the chunk is encoded correctly
// impl<TStorage: WritableStorageTraits + ..> Array<TStorage>
pub unsafe fn store_encoded_chunk(
&self,
chunk_indices: &[u64],
encoded_chunk_bytes: bytes::Bytes,
) -> Result<(), ArrayError> {
// part of the store_chunk_opt() implementation
...
crate::storage::store_chunk(
&*storage_transformer,
self.path(),
chunk_indices,
self.chunk_key_encoding(),
encoded_chunk_bytes,
)?;
}
Longer termI need to think on this for a while... but one potential path that might benefit all
Footnotes
|
Thanks, I'll push an update tomorrow. |
Done.
👍
I'll need to look a bit more into how the
I think it would, yes. I do realize that we have a use-case like in the
I think it's a good idea sit on this a bit, and to first play around with other local I/O solutions like
What if the codecs (or at least the last codec in the chain) instead ask the storage for a buffer? I haven't thought this to completion, but basically if the storage could provide (or even "loan" for re-use) buffers out to the upper layers, it could hide implementation details like alignment, or any other special treatment that is needed for the specific storage backend. Thinking about the "pass-through" case:
In the non-pass-through case, this chain would be interrupted wherever the data needs to change - let's imagine a single compression codec, which writes into a buffer provided by the storage, and provides a different buffer for the upper codec layers to write their data into. I hope I have the right mental model how this codec chain works...
Agreed! IMHO they can also be opt-in, if the added API surface is not too large, so the user can decide where they want to land on the performance vs usability scale. |
Check for zero-size writes and disable `direct_io` for these.
See #64 for a first try at implementing this. The answer is, yes, it performs 🥳 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just needs a few #[cfg(unix)]
to pass CI
MacOS also doesn't like `O_DIRECT`, so limit to Linux instead of Unix for now.
Thanks!
👍
Yep, that could probably work. The standard encode/decode methods of codecs do not provide any access to storage. But what you are after is not so different from the partial decoders and partial encoders (#45). They pass input/output handles through the codec chain and the last codec ends up calling storage methods via StoragePartialDecoder / StoragePartialEncoder. The same ideas could be applied to non-partial encoding. |
This is great stuff! @sk1p wrote:
I'm happy to help think through using I did some quick benchmarking on a PCIe 5 SSD on an AMD Epyc workstation. The headline is that However, when reading the entire files at once, So, when reading many small chunks (256 KiB), However, it's a lot of effort to achieve this throughput! My first |
Thank you for the offer! I already came across your project while researching
These are impressive numbers! I'm fortunate that we can often work in more comfortable block sizes - 1MiB+ is common for us, and for writing data, often even larger (something like 8MiB for
I can imagine, and I guess you learned a lot with that project! I didn't yet look into your benchmarks in detail - does From your README:
It's interesting, I'm kind of going in the opposite direction - in LiberTEM, we first worked on doing efficient out-of-core, thread-per-core streaming computation, mostly on vendor-specific file formats, and are now moving towards writing data directly into zarr files from the detector systems (while also doing some "live" stream processing, think previews/immediate feedback etc.). Someone recently used a nice term for this - having the data "born FAIR". I'm definitely interested in staying in contact, let me know if you'd like to chat some time! |
My understanding is that the requirements for buffer alignment are defined by the filesystem. So, for example, if your filesystem requires that the start and end of memory buffers are aligned to 512-byte boundaries, then you'll have to align your buffers to 512-byte boundaries, no matter if you're using IIRC, one of the benefits of
My understanding is that it depends on your read pattern. The page cache is the default because, for most read patterns, you do get better performance if you use the Linux page cache. But, for example, if you want to read lots of small chunks which are randomly distributed, and if you'll never re-read those same chunks (or if you implement your own cache) then
Absolutely! Sounds good! |
Refs #53. I've updated the example in https://github.com/LiberTEM/zarr-dio-proto/ with a case that uses this branch for writing. Let me know if a discussion directly on the PR works for you, otherwise I can move this to the issue. Performance already looks better than the buffered case:
(performance in the buffered case also depends on if we are overwriting an existing array, in that case it takes ~51s)
Open questions
FilesystemDIOStore
or an stay option, as implemented now?new_with_options
forwards-compatible? Should it use a builder or is that overkill for now?Layout
can't be constructed? Ispanic!
in case of allocation failure OK?seek
also has to be page-size aligned, which is a bit annoying. One could maybe return the buffers with padding and deref to the actually requestedBytes
somehow?Profile discussion
An annotated flamegraph of running the example code:
From left to right:
memcpy
comes fromBytes::from(chunk_encoded.into_owned())
, as far as I can seePageAlignedBuffer::drop
, see belowPageAlignedBuffer
and its initialization to zero beforehandmkdir
,stat
,open
,write
etc.)There are two parts to the remaining overhead: 1) the fact that we have to touch the memory three times (initialization and two copies), and 2) that the allocations aren't re-used and a significant amount of time is spent in page faults and free'ing memory back to the operating system.
The page faults and the cost of
PageAlignedBuffer::drop
are caused by the default system-widemalloc
not re-using large allocations by default; re-running withMALLOC_TRIM_THRESHOLD_=18388608 MALLOC_MMAP_THRESHOLD_=18388608
(which happen to be thresholds larger than the per-chunk buffers) reduces the total run time to ~11.7s. This is great, that's already within 2x of the "ideal". It has the following profile:As it is hard to control these thresholds from a library, an alternative would be to explicitly re-use buffers across multiple chunk writes, but I'm not yet sure how to do this in a non-hacky way (a hacky way: just keep the buffer around in a thread-local
RefCell
... works, but also neverfree
's). Maybe one of the arena crates can be used?As for the memory copies and initialization:
PageAlignedBuffer
can be amortized by re-using the initialized buffer over the life of, for example, theFilesystemStore
, and/or initializing directly from the given slice of bytesFilesystemStore
for a suitable buffer, which would already be page-size aligned (removing the last copy)store_chunk_elements
, maybe the user could instead be presented with a mutable slice (orArrayViewMut<T>
or similar) where they write the data, which is backed by a suitable buffer. I think this would get rid of the copy byinto_owned
, as the buffer would be managed by theFilesystemStore
.For the last point, a sketch of an API could be:
In the uncompressed case,
buf
would then be handed down without copying all the way to theFilesystemStore
, where it can be directly used for I/O. Thoughts?For comparison, a before-profile using buffered I/O:
(all profiles taken on Linux 6.1 / Debian stable with
perf record --call-graph=dwarf -F 1234 <bin>
, converted usingperf script -F +pid
and visualized using https://profiler.firefox.com/)