Add io_uring event loop #15634

ysbaddaden · 2025-04-04T10:37:10Z

Initial attempt at writing an EventLoop backed by io_uring for Linux targets.

It requires different features that have been implemented in different versions of the kernel (I'm not sure exactly which), one of the most important being "don't drop any CQE as long as there is memory" 😅

It supports SQPOLL (implicit submissions) to further reduce the number of syscalls.

Unlike the polling EventLoop (libevent, epoll, kqueue) it's fully async, meaning that any attempt to read or write will yield the current fiber for example, whatever if there could have been something to read. Expect lots of fiber yields!

WARNING: THREAD UNSAFE! MT will segfault. Only try it with single thread 💣 💣 💣

PREREQUISITES:

Add explicit Crystal::EventLoop#reopened(FileDescriptor) hook #15640
Let Crystal::EventLoop#close do the actual close (not just cleanup) #15641
conditional O_NONBLOCK (see Add io_uring event loop #15634 (comment))
rework blocking arg on File (?)

TODO:

LibC bindings for all linux targets;
determine minimal linux kernel version;
epoll fallback (old kernel, not compiled, disabled, ...);
thread safety;
GC finalizers: cancel operations on .remove_impl(FileDescriptor|Socket) so we don't leak fibers

BONUSES (over Epoll):

only make syscalls when strictly necessary (full submission queue, empty completion queue) otherwise they're offloaded to the kernel threads;
disk file read and write are async;
close file/socket is async;
an always ready IO now yields the fiber.

FOLLOW UP:

moving open to the eventloop could make opening a fifo or character device asynchronous and fix the issue where the thread is blocked until another process also opens the fifo (regardless of the blocking arg);
push MORE things to the event-loop (open, fsync, fstat, mkdir, link, listen, bind, ...) 🤑

NOTES:

I didn't use liburing to avoid bringing an external dependency, plus the io_uring_prep_* functions are inlined in the io_uring.h header, and would have had to be rewritten anyway. The Crystal::System::IoUring struct does most of the whole job, then Crystal::EventLoop::IoUring directly fills the SQE.

Adding support for MT and EC involves a few hurdles. See #10740 (comment)

Abstraction of `io_uring` syscalls to create a ring, map the kernel buffers into userspace, submit operations and iterate completions. Also provides optional support for SQPOLL with proper wakeup of the SQ thread when thread.

…+ fallback

Since read can also fail with EINTR we may always have to retry...

We must submit operation chains in a single shot, that is update the SQ tail shared with the kernel (sq_ktail) in a single STORE after populating all the SQE to chain together. This led to an overhaul refactor of the System::IoUring abstraction and the EventLoop::IoUring async helpers. Fixes the issue where CLOSE happens after ASYNC_CANCEL when closing a file descriptor. Makes sure that LINK_TIMEOUT will always be correctly registered to the previous READ, WRITE or POLL.

ysbaddaden · 2025-04-04T10:54:02Z

src/crystal/event_loop/io_uring.cr

+      # one thread closing a fd won't interrupt reads or writes happening in
+      # other threads, for example a blocked read on a fifo will keep blocking,
+      # while close would have finished and closed the fd; we thus explicitly
+      # cancel any pending operations on the fd before we try to close


The close(2) manpage explicitly states that some systems interrupt any blocking read or write but the Linux behavior is to 🙈

src/crystal/event_loop/io_uring.cr

ysbaddaden · 2025-04-04T11:07:32Z

src/crystal/system/unix/io_uring.cr

+    # TODO: we could check if tail changed and iterate more, until we reach the
+    # maximum iterations count
+  end
+


The following enums are only used to enhance Crystal.trace.

ysbaddaden · 2025-04-04T11:10:05Z

src/crystal/event_loop/io_uring.cr

+
+  def interrupt : Nil
+    # the atomic makes sure we only write once (no need to write multiple times)
+    @eventfd.write(1) if @interrupted.test_and_set


This is broken: there is no @eventfd.

yxhuvud · 2025-04-04T17:48:52Z

plus the io_uring_prep_* functions are inlined in the io_uring.h header, and would have had to be rewritten anyway.

Well, someone (cough) has already done that job, though I can definitely understand not wanting the extra dependency.. That said there is a flag that can be submitted when building liburing that don't inline anything (thanks rust people!), but assuming that build to be available is optimistic, unless we build it ourselves.

(will look at actual code later, but a take on it that might perhaps inspire may be https://github.com/yxhuvud/nested_scheduler_io_uring_context/blob/main/src/nested_scheduler/io_uring_context.cr , which is a plugin to nested_scheduler to use io_uring. it is definitely broken in some aspects, not even counting the general shift of the codebase that has happened since the nested_scheduler set of monkeypatches worked. FWIW the best part of nested scheduler was how much it possible to clean up the specs).

EDIT: OH, and there is a nice io_uring discord available if you want to bounce ideas with people. some of your musings, like the close fd parts, may have good ideas or at least answers there.. https://discord.gg/T9WqsqPZ

yxhuvud

Neat.

Regarding TODO:

push MORE things to the event-loop (mkdir, listen, bind, ...) 🤑

And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.

yxhuvud · 2025-04-05T10:00:11Z

src/crystal/system/unix/io_uring.cr

+  end
+
+  def finalize
+    close


drain, could perhaps be necessary. But perhaps it is like exit and flushing writes 🤷

Let's err on the safe side and say not. We'll need the ability to drain a ring if we want to shutdown an execution context or thread anyway.

yxhuvud · 2025-04-05T10:23:07Z

src/lib_c/x86_64-linux-gnu/c/linux/io_uring.cr

+  IORING_FEAT_LINKED_FILE     = 1_u32 << 12
+  IORING_FEAT_REG_REG_RING    = 1_u32 << 13
+
+  IORING_OP_NOP              =  0_u32


I find this weird, as the op field in the struct is a u8 and not a u32, but the weirdness is also present in liburing, so I guess it doesn't matter. The same confusion exist in SQE_FLAGS.

The age old question of if to mirror the C files structure, or to use properly sized enums, I guess. Compare

enum Op : UInt8 NOP READV .. end

Which may be a bit less prone to copy-pasta errors as long as the order is kept correct.

yxhuvud · 2025-04-05T10:31:44Z

src/crystal/event_loop/io_uring.cr

+  def delete_timer(event : Event*) : Nil
+    sqe = @ring.next_sqe
+    sqe.value.opcode = LibC::IORING_OP_TIMEOUT_REMOVE
+    sqe.value.flags = LibC::IOSQE_CQE_SKIP_SUCCESS


I have a hard time seeing this is enough requests to matter either way (though I am open to be shown to be wrong).

Isn't it easier to just put a user_data on it that trigger a nop? I used 0 for this. Or check the CQE result for the canceled-result, if that is what you are trying to avoid?

It's really just a "don't even bother pushing a CQE I don't care about".

src/crystal/event_loop/io_uring.cr

yxhuvud · 2025-04-05T11:04:03Z

src/crystal/event_loop/io_uring.cr

+    async_poll(socket, LibC::POLLOUT, socket.@write_timeout) { "Write timed out" }
+  end
+
+  def accept(socket : ::Socket) : ::Socket::Handle?


If the socket is in nonblocking mode, then this will give you an EAGAIN if there isn't an ongoing connection attempt. Which in practice mean it needs to fall back to wait_readable in that case.

Nope. It just blocks. This is how I noticed that I need to shutdown a socket before close because accept would NEVER return a CQE in one of the std specs.

Are you certain you didn't open the socket in blocking mode? Because I know for a fact that I spent a long time swearing over that behavior when opening it in nonblocking mode as it was essentially forcing me to either change the default mode or to do a wait_readable-loop instead of just issuing accept/read etc.

But perhaps it was changed at some point as it isn't very useful..

Well, sockets are always nonblocking by default. I had two cases in the std specs where the fiber would never be resumed:

spec/std/io/io_spec.cr:104 (read blocks)

spec/std/socket/unix_server_spec.cr:98 (accept blocks)

I just checked and they're non-blocking, and I never get a CQE until I shutdown the sockfd or manually cancel the operation.

I replicated the UnixServer#accept spec for TCPServer, and same behavior: the socket has O_NONBLOCK and io_uring blocks on accept.

Ok. That is really weird, but I guess the weirdness is in the right direction so it is fine 😅

Oh, good find! That'd explain it. I knew about that thread but that was before it had a million replies and an actual resolution :)

Hmm, I wonder if this has made the close-situation worse? 🤔

I'm not sure I follow the thread correctly. It might have been fixed, or they expect the fd to not have O_NONBLOCK or only fixed something in IORING_OP_POLL 🤷

Since O_NONBLOCK is at best pointless with io_uring and at worst can cause backward compatibility issues with older linux kernels, we might want to introduce a method on Crystal::EventLoop to set the O_NONBLOCK for new file descriptors (open, socket, accept, dup) when the backend needs it (poll, epoll, kqueue) and do nothing otherwise (IOCP, io_uring).

That would still allow an epoll fallback when io_uring isn't available to set O_NONBLOCK —same for cases where the fd comes from an external library, maybe?

Yeah, it might be not be helping with close. I'll have to check.

src/crystal/event_loop/io_uring.cr

ysbaddaden · 2025-04-05T17:00:09Z

And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.

At least read and write are already, but fsync and fstat aren't.

yxhuvud · 2025-04-05T17:09:57Z

At least read and write are already

Ah, I got confused by the *_fully methods.

Said differently: we don't need the sys/uio header just to bring the iovec struct because sys/socket shall define it.

ysbaddaden added 19 commits March 25, 2025 11:07

WIP: Add Crystal::System::IoUring

4aa2dca

Abstraction of `io_uring` syscalls to create a ring, map the kernel buffers into userspace, submit operations and iterate completions. Also provides optional support for SQPOLL with proper wakeup of the SQ thread when thread.

Set io_uring_params.sq_thread_idle + support IORING_SETUP_NO_SQARRAY …

f953a12

…+ fallback

Fix: bindings

2b62f79

Initial Crystal::EventLoop::IoUring

fdd5290

Fix: panic if io_uring_enter reports dropped CQEs

58d37bd

Support close, sleep, and basic (check cqe & wait)

32bd599

Fix: sleep, select timeout, ...

a22d724

Fix: compilation + issues => file/socket specs are green

bc5bfee

Fixes

a8c7fcf

Fix: write can fail with EINTR (we must retry)

cda8141

Since read can also fail with EINTR we may always have to retry...

Add Crystal::EventLoop::IoUring#sleep(duration)

da034d3

Drop class_getter?

8da2198

Fix: read can fail with EINTR (we must retry)

7c9864b

Fix: add EventLoop#reopened(FileDescriptor)

3b2362e

Fix: always shutdown read before closing socket

65cbc0e

Add some checks to verify if an IO is opened

db5d179

Fix: don't link timeout op to sleep op...

395471d

Fix: select timeout

822f9fb

ysbaddaden added kind:feature status:draft topic:stdlib:runtime labels Apr 4, 2025

ysbaddaden self-assigned this Apr 4, 2025

ysbaddaden commented Apr 4, 2025

View reviewed changes

src/crystal/event_loop/io_uring.cr Show resolved Hide resolved

ysbaddaden commented Apr 4, 2025

View reviewed changes

straight-shoota mentioned this pull request Apr 4, 2025

Linux's IO_Uring interface (2x IO performance!) #10740

Open

ysbaddaden linked an issue Apr 4, 2025 that may be closed by this pull request

Linux's IO_Uring interface (2x IO performance!) #10740

Open

ysbaddaden added 2 commits April 4, 2025 19:50

Fix: crystal tool format

f2978e0

Fix: simplify attach wq param

bbf91bb

yxhuvud reviewed Apr 5, 2025

View reviewed changes

ysbaddaden added 2 commits April 6, 2025 12:39

Use IORING_OP_SENDMSG / IORING_OP_RECVFROM

4b0d049

Fix: sys/socket shall define struct iovec as per sys/uio

91f3b72

Said differently: we don't need the sys/uio header just to bring the iovec struct because sys/socket shall define it.

This was referenced Apr 7, 2025

Add explicit Crystal::EventLoop#reopened(FileDescriptor) hook #15640

Open

Let Crystal::EventLoop#close do the actual close (not just cleanup) #15641

Open

The event loops should handle the non-blocking behavior of files, fildes and sockets #15652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add io_uring event loop #15634

Add io_uring event loop #15634

ysbaddaden commented Apr 4, 2025 •

edited

Loading

ysbaddaden Apr 4, 2025

ysbaddaden Apr 4, 2025

ysbaddaden Apr 4, 2025

yxhuvud commented Apr 4, 2025 •

edited

Loading

yxhuvud left a comment

yxhuvud Apr 5, 2025

ysbaddaden Apr 5, 2025 •

edited

Loading

yxhuvud Apr 5, 2025

yxhuvud Apr 5, 2025

ysbaddaden Apr 5, 2025

yxhuvud Apr 5, 2025

ysbaddaden Apr 5, 2025 •

edited

Loading

yxhuvud Apr 5, 2025 •

edited

Loading

ysbaddaden Apr 6, 2025 •

edited

Loading

yxhuvud Apr 6, 2025

yxhuvud Apr 6, 2025

yxhuvud Apr 6, 2025

ysbaddaden Apr 6, 2025

ysbaddaden Apr 6, 2025

ysbaddaden Apr 6, 2025

ysbaddaden commented Apr 5, 2025 •

edited

Loading

yxhuvud commented Apr 5, 2025

Add io_uring event loop #15634

Are you sure you want to change the base?

Add io_uring event loop #15634

Conversation

ysbaddaden commented Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yxhuvud commented Apr 4, 2025 • edited Loading

yxhuvud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ysbaddaden Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ysbaddaden Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

yxhuvud Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

ysbaddaden Apr 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ysbaddaden commented Apr 5, 2025 • edited Loading

yxhuvud commented Apr 5, 2025

ysbaddaden commented Apr 4, 2025 •

edited

Loading

yxhuvud commented Apr 4, 2025 •

edited

Loading

ysbaddaden Apr 5, 2025 •

edited

Loading

ysbaddaden Apr 5, 2025 •

edited

Loading

yxhuvud Apr 5, 2025 •

edited

Loading

ysbaddaden Apr 6, 2025 •

edited

Loading

ysbaddaden commented Apr 5, 2025 •

edited

Loading