Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] I/O error on Linux kernel with 64KiB base page size #450

Open
flx42 opened this issue Jun 10, 2024 · 0 comments
Open

[BUG] I/O error on Linux kernel with 64KiB base page size #450

flx42 opened this issue Jun 10, 2024 · 0 comments

Comments

@flx42
Copy link

flx42 commented Jun 10, 2024

Describe the bug
The aio reader code assumes that direct I/O operations must always be aligned to 4096:
https://github.com/NVIDIA-Merlin/HugeCTR/blob/v24.04.00/HugeCTR/src/data_readers/multi_hot/detail/aio_context.cpp#L122
However, it depends on multiple factors like the filesystem type and the base page size of the kernel.

On systems with a 64KiB page size, I/O on the Lustre filesystem will fail if we only align accesses to 4096:
https://doc.lustre.org/lustre_manual.xhtml#performing_directio

Applications using the read() and write() calls must supply buffers aligned on a page boundary (usually 4 K). If the alignment is not correct, the call returns -EINVAL.

However, on the same system, using 4KiB as the alignment might work for another filesystem, for example ext4 on NVMe SSDs.

If we expect that I/O operations will always read very large chunks of the file, then setting the alignment to always be equal to the page size should be fine, so we can use sysconf(_SC_PAGESIZE); instead of 4096. If we expect smaller reads, then using a larger than necessary alignment could be detrimental to performance: e.g. when having to read 64KiB when you only want to 1KiB. In this latter case, the code might very well need to figure out what are the alignment requirements at runtime, or expose knobs for users to override it if their application crash.

To Reproduce
On a system with 64KiB pages, try loading a file from a Lustre filesystem with HugeCTR, it will fail:

terminate called after throwing an instance of 'std::runtime_error'
  what():  io_getevents returned failed event: Invalid argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant