-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add io_uring event loop #15634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add io_uring event loop #15634
Conversation
Abstraction of `io_uring` syscalls to create a ring, map the kernel buffers into userspace, submit operations and iterate completions. Also provides optional support for SQPOLL with proper wakeup of the SQ thread when thread.
Since read can also fail with EINTR we may always have to retry...
We must submit operation chains in a single shot, that is update the SQ tail shared with the kernel (sq_ktail) in a single STORE after populating all the SQE to chain together. This led to an overhaul refactor of the System::IoUring abstraction and the EventLoop::IoUring async helpers. Fixes the issue where CLOSE happens after ASYNC_CANCEL when closing a file descriptor. Makes sure that LINK_TIMEOUT will always be correctly registered to the previous READ, WRITE or POLL.
# one thread closing a fd won't interrupt reads or writes happening in | ||
# other threads, for example a blocked read on a fifo will keep blocking, | ||
# while close would have finished and closed the fd; we thus explicitly | ||
# cancel any pending operations on the fd before we try to close |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The close(2) manpage explicitly states that some systems interrupt any blocking read
or write
but the Linux behavior is to 🙈
# TODO: we could check if tail changed and iterate more, until we reach the | ||
# maximum iterations count | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following enums are only used to enhance Crystal.trace
.
|
||
def interrupt : Nil | ||
# the atomic makes sure we only write once (no need to write multiple times) | ||
@eventfd.write(1) if @interrupted.test_and_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is broken: there is no @eventfd
.
Well, someone (cough) has already done that job, though I can definitely understand not wanting the extra dependency.. That said there is a flag that can be submitted when building liburing that don't inline anything (thanks rust people!), but assuming that build to be available is optimistic, unless we build it ourselves. (will look at actual code later, but a take on it that might perhaps inspire may be https://github.com/yxhuvud/nested_scheduler_io_uring_context/blob/main/src/nested_scheduler/io_uring_context.cr , which is a plugin to nested_scheduler to use io_uring. it is definitely broken in some aspects, not even counting the general shift of the codebase that has happened since the nested_scheduler set of monkeypatches worked. FWIW the best part of nested scheduler was how much it possible to clean up the specs). EDIT: OH, and there is a nice io_uring discord available if you want to bounce ideas with people. some of your musings, like the close fd parts, may have good ideas or at least answers there.. https://discord.gg/T9WqsqPZ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat.
Regarding TODO:
push MORE things to the event-loop (mkdir, listen, bind, ...) 🤑
And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.
end | ||
|
||
def finalize | ||
close |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drain
, could perhaps be necessary. But perhaps it is like exit
and flushing writes 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's err on the safe side and say not. We'll need the ability to drain a ring if we want to shutdown an execution context or thread anyway.
IORING_FEAT_LINKED_FILE = 1_u32 << 12 | ||
IORING_FEAT_REG_REG_RING = 1_u32 << 13 | ||
|
||
IORING_OP_NOP = 0_u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this weird, as the op field in the struct is a u8 and not a u32, but the weirdness is also present in liburing, so I guess it doesn't matter. The same confusion exist in SQE_FLAGS.
The age old question of if to mirror the C files structure, or to use properly sized enums, I guess. Compare
enum Op : UInt8
NOP
READV
..
end
Which may be a bit less prone to copy-pasta errors as long as the order is kept correct.
def delete_timer(event : Event*) : Nil | ||
sqe = @ring.next_sqe | ||
sqe.value.opcode = LibC::IORING_OP_TIMEOUT_REMOVE | ||
sqe.value.flags = LibC::IOSQE_CQE_SKIP_SUCCESS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a hard time seeing this is enough requests to matter either way (though I am open to be shown to be wrong).
Isn't it easier to just put a user_data on it that trigger a nop? I used 0 for this. Or check the CQE result for the canceled-result, if that is what you are trying to avoid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really just a "don't even bother pushing a CQE I don't care about".
async_poll(socket, LibC::POLLOUT, socket.@write_timeout) { "Write timed out" } | ||
end | ||
|
||
def accept(socket : ::Socket) : ::Socket::Handle? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the socket is in nonblocking mode, then this will give you an EAGAIN if there isn't an ongoing connection attempt. Which in practice mean it needs to fall back to wait_readable
in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. It just blocks. This is how I noticed that I need to shutdown
a socket before close
because accept
would NEVER return a CQE in one of the std specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you certain you didn't open the socket in blocking mode? Because I know for a fact that I spent a long time swearing over that behavior when opening it in nonblocking mode as it was essentially forcing me to either change the default mode or to do a wait_readable-loop instead of just issuing accept/read etc.
But perhaps it was changed at some point as it isn't very useful..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, sockets are always nonblocking by default. I had two cases in the std specs where the fiber would never be resumed:
spec/std/io/io_spec.cr:104
(read
blocks)spec/std/socket/unix_server_spec.cr:98
(accept
blocks)
I just checked and they're non-blocking, and I never get a CQE until I shutdown the sockfd or manually cancel the operation.
I replicated the UnixServer#accept spec for TCPServer, and same behavior: the socket has O_NONBLOCK and io_uring blocks on accept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. That is really weird, but I guess the weirdness is in the right direction so it is fine 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good find! That'd explain it. I knew about that thread but that was before it had a million replies and an actual resolution :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder if this has made the close-situation worse? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the thread correctly. It might have been fixed, or they expect the fd to not have O_NONBLOCK or only fixed something in IORING_OP_POLL 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since O_NONBLOCK
is at best pointless with io_uring and at worst can cause backward compatibility issues with older linux kernels, we might want to introduce a method on Crystal::EventLoop
to set the O_NONBLOCK
for new file descriptors (open
, socket
, accept
, dup
) when the backend needs it (poll, epoll, kqueue) and do nothing otherwise (IOCP, io_uring).
That would still allow an epoll fallback when io_uring isn't available to set O_NONBLOCK
—same for cases where the fd
comes from an external library, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it might be not be helping with close
. I'll have to check.
At least |
Ah, I got confused by the *_fully methods. |
Said differently: we don't need the sys/uio header just to bring the iovec struct because sys/socket shall define it.
Initial attempt at writing an EventLoop backed by io_uring for Linux targets.
It requires different features that have been implemented in different versions of the kernel (I'm not sure exactly which), one of the most important being "don't drop any CQE as long as there is memory" 😅
It supports SQPOLL (implicit submissions) to further reduce the number of syscalls.
Unlike the polling EventLoop (libevent, epoll, kqueue) it's fully async, meaning that any attempt to read or write will yield the current fiber for example, whatever if there could have been something to read. Expect lots of fiber yields!
WARNING: THREAD UNSAFE! MT will segfault. Only try it with single thread 💣 💣 💣
PREREQUISITES:
Crystal::EventLoop#reopened(FileDescriptor)
hook #15640Crystal::EventLoop#close
do the actual close (not just cleanup) #15641O_NONBLOCK
(see Add io_uring event loop #15634 (comment))blocking
arg onFile
(?)TODO:
.remove_impl(FileDescriptor|Socket)
so we don't leak fibersBONUSES (over Epoll):
FOLLOW UP:
blocking
arg);open
,fsync
,fstat
,mkdir
,link
,listen
,bind
, ...) 🤑NOTES:
I didn't use liburing to avoid bringing an external dependency, plus the
io_uring_prep_*
functions are inlined in theio_uring.h
header, and would have had to be rewritten anyway. TheCrystal::System::IoUring
struct does most of the whole job, thenCrystal::EventLoop::IoUring
directly fills the SQE.Adding support for MT and EC involves a few hurdles. See #10740 (comment)