Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Numa affinity for worker threads #1130

Merged
merged 15 commits into from
Dec 16, 2023

Conversation

maikel
Copy link
Collaborator

@maikel maikel commented Nov 1, 2023

This PR introduces numa awareness for the static_thread_pool.

Summary of Changes

  • There is a new CMake option STDEXEC_ENABLE_NUMA to explicitly opt-in numa awareness. Notice that we need to link stdexec with libnuma, if enabled. We might also introduce a new CMake target especially for exec or for exec::static_thread_pool.
  • The constructor of static_thread_pool takes an additional pointer to exec::numa_policy, which defaults to exec::get_default_numa_policy(). This Numa policy defines a distribution mapping, which maps a thread number to a Numa node. It also provides a member function to bind the current thread to a specified Numa node. If STDEXEC_ENABLE_NUMA is false, then the numa policy is doing nothing.
  • We use a stateful exec::numa_allocator<T> that allocates memory on a specified numa node.
  • Introduce member functions static_thread_pool::get_scheduler_on(numa_node_mask) and static_thread_pool::get_scheduler_on(cpu_mask) to return a scheduler that schedules with specified constraints.
  • Worker threads try to steal from other worker threads on their own Numa node. When this fails a certain amount of times every other worker thread becomes a stealing target.

API Design

This PR introduces the interface numa_policy which is defined as

  struct numa_policy {
    virtual std::size_t num_nodes() = 0;
    virtual std::size_t num_cpus(int node) = 0;
    virtual int bind_to_node(int node) = 0;
    virtual std::size_t thread_index_to_node(std::size_t index) = 0;
  };

The thread pool takes a pointer to numa_policy as an optional argument to customize the distribution of worker threads to Numa nodes.

The static_thread_pool has the following ways to get a scheduler with certain properties

    // Returns a scheduler without any constraints
    scheduler get_scheduler() noexcept;

    // Returns a scheduler that schedules on a specific worker thread (enumerated from 0...N-1)
    scheduler get_scheduler_on_thread(std::size_t threadIndex) noexcept;
    
    // Returns a scheduler that schedules only on worker threads that run on one of the specified nodes
    scheduler get_constrained_scheduler(const nodemask& constraints) noexcept;

Benchmarks

We perform a benchmark on a machine with 2 numa nodes. Each node has 14 cores (28 threads).

  Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           2

We perform the nested schedule benchmark from the last PR and we see that the scaling is better with numa affinity enabled.

The OS scheduler does an amazing job when having less than 28 threads.

numa_vs_non_numa

Max throughput

No Numa With Numa
5.27+e08 6.4e+08

Which is an improvement of roughly 20%

Copy link

copy-pr-bot bot commented Nov 1, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mfbalin
Copy link

mfbalin commented Nov 1, 2023

@maikel I can run this code if you need numa machines to benchmark on.

@ericniebler
Copy link
Collaborator

/ok to test

@maikel
Copy link
Collaborator Author

maikel commented Nov 2, 2023

@maikel I can run this code if you need numa machines to benchmark on.

My current test machine has 2 numa nodes:

  Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           2

I will contact you for more data when I'm done with an initial attempt

@ericniebler
Copy link
Collaborator

/ok to test

@ericniebler
Copy link
Collaborator

/ok to test

static_thread_pool(
std::uint32_t threadCount,
bwos_params params = {},
numa_policy* numa = get_numa_policy());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not a fan of raw pointers in public interfaces. why does numa_policy need to be dynamically polymorphic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like a template parameter for the Numa policy more? There is actually no reason to use type erasure when all member functions are defined inline anyway. That way I wouldn't need to hard code an allocator, too.

@maikel
Copy link
Collaborator Author

maikel commented Dec 15, 2023

We have discussed that the planned improvements to this happen in an extra PR.

@maikel maikel marked this pull request as ready for review December 15, 2023 08:20
@ericniebler
Copy link
Collaborator

/ok to test

@ericniebler ericniebler merged commit e7cd275 into NVIDIA:main Dec 16, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants