Implement Numa affinity for worker threads #1130

maikel · 2023-11-01T12:27:52Z

This PR introduces numa awareness for the static_thread_pool.

Summary of Changes

There is a new CMake option STDEXEC_ENABLE_NUMA to explicitly opt-in numa awareness. Notice that we need to link stdexec with libnuma, if enabled. We might also introduce a new CMake target especially for exec or for exec::static_thread_pool.
The constructor of static_thread_pool takes an additional pointer to exec::numa_policy, which defaults to exec::get_default_numa_policy(). This Numa policy defines a distribution mapping, which maps a thread number to a Numa node. It also provides a member function to bind the current thread to a specified Numa node. If STDEXEC_ENABLE_NUMA is false, then the numa policy is doing nothing.
We use a stateful exec::numa_allocator<T> that allocates memory on a specified numa node.
Introduce member functions static_thread_pool::get_scheduler_on(numa_node_mask) and static_thread_pool::get_scheduler_on(cpu_mask) to return a scheduler that schedules with specified constraints.
Worker threads try to steal from other worker threads on their own Numa node. When this fails a certain amount of times every other worker thread becomes a stealing target.

API Design

This PR introduces the interface numa_policy which is defined as

  struct numa_policy {
    virtual std::size_t num_nodes() = 0;
    virtual std::size_t num_cpus(int node) = 0;
    virtual int bind_to_node(int node) = 0;
    virtual std::size_t thread_index_to_node(std::size_t index) = 0;
  };

The thread pool takes a pointer to numa_policy as an optional argument to customize the distribution of worker threads to Numa nodes.

The static_thread_pool has the following ways to get a scheduler with certain properties

    // Returns a scheduler without any constraints
    scheduler get_scheduler() noexcept;

    // Returns a scheduler that schedules on a specific worker thread (enumerated from 0...N-1)
    scheduler get_scheduler_on_thread(std::size_t threadIndex) noexcept;
    
    // Returns a scheduler that schedules only on worker threads that run on one of the specified nodes
    scheduler get_constrained_scheduler(const nodemask& constraints) noexcept;

Benchmarks

We perform a benchmark on a machine with 2 numa nodes. Each node has 14 cores (28 threads).

  Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           2

We perform the nested schedule benchmark from the last PR and we see that the scaling is better with numa affinity enabled.

The OS scheduler does an amazing job when having less than 28 threads.

Max throughput

No Numa	With Numa
5.27+e08	6.4e+08

Which is an improvement of roughly 20%

copy-pr-bot · 2023-11-01T12:27:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mfbalin · 2023-11-01T20:09:41Z

@maikel I can run this code if you need numa machines to benchmark on.

ericniebler · 2023-11-02T02:39:08Z

/ok to test

maikel · 2023-11-02T12:27:52Z

@maikel I can run this code if you need numa machines to benchmark on.

My current test machine has 2 numa nodes:

  Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           2

I will contact you for more data when I'm done with an initial attempt

ericniebler · 2023-11-02T23:07:26Z

/ok to test

ericniebler · 2023-11-03T16:23:36Z

/ok to test

ericniebler · 2023-11-03T17:04:34Z

include/exec/static_thread_pool.hpp

+    static_thread_pool(
+      std::uint32_t threadCount,
+      bwos_params params = {},
+      numa_policy* numa = get_numa_policy());


i'm not a fan of raw pointers in public interfaces. why does numa_policy need to be dynamically polymorphic?

Would you like a template parameter for the Numa policy more? There is actually no reason to use type erasure when all member functions are defined inline anyway. That way I wouldn't need to hard code an allocator, too.

maikel · 2023-12-15T08:19:29Z

We have discussed that the planned improvements to this happen in an extra PR.

ericniebler · 2023-12-15T16:44:03Z

/ok to test

WIP: Implement numa affinity for worker threads

e1ffc15

maikel added 2 commits November 1, 2023 14:36

Provide default numa allocator

7f4614b

Missing include

52bae03

maikel added 3 commits November 2, 2023 10:03

Add thread to numa node distribution

6a39216

Allocate buffers on numa nodes

b496f23

Specify numa policy in examples

bb39d12

Fix build config without numa

b17b58a

maikel added 7 commits November 3, 2023 07:45

New enqueue algorithm

c257cc5

Numa aware enqueue

1840d7e

Default numa nodemask

8451b3b

Implement enqueue on thread

c47c522

Adjust example for numa affinity

0540fab

Non numa fixes

f058bc7

Fix clang-12

802dc49

ericniebler reviewed Nov 3, 2023

View reviewed changes

Merge branch 'main' into numa-affinity

93ebad5

maikel marked this pull request as ready for review December 15, 2023 08:20

ericniebler merged commit e7cd275 into NVIDIA:main Dec 16, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Numa affinity for worker threads #1130

Implement Numa affinity for worker threads #1130

maikel commented Nov 1, 2023 •

edited

Loading

copy-pr-bot bot commented Nov 1, 2023

mfbalin commented Nov 1, 2023

ericniebler commented Nov 2, 2023

maikel commented Nov 2, 2023

ericniebler commented Nov 2, 2023

ericniebler commented Nov 3, 2023

ericniebler Nov 3, 2023

maikel Nov 3, 2023

maikel commented Dec 15, 2023

ericniebler commented Dec 15, 2023

Implement Numa affinity for worker threads #1130

Implement Numa affinity for worker threads #1130

Conversation

maikel commented Nov 1, 2023 • edited Loading

Summary of Changes

API Design

Benchmarks

copy-pr-bot bot commented Nov 1, 2023

mfbalin commented Nov 1, 2023

ericniebler commented Nov 2, 2023

maikel commented Nov 2, 2023

ericniebler commented Nov 2, 2023

ericniebler commented Nov 3, 2023

ericniebler Nov 3, 2023

Choose a reason for hiding this comment

maikel Nov 3, 2023

Choose a reason for hiding this comment

maikel commented Dec 15, 2023

ericniebler commented Dec 15, 2023

maikel commented Nov 1, 2023 •

edited

Loading