Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit #415

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

caiomcbr
Copy link
Contributor

No description provided.

@@ -11,6 +11,8 @@

namespace mscclpp {

constexpr int MAX_INFLIGHT_REQUEST = 500;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default CQ size. It's a constant number or can be configured by users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is 1024, depending on the hardware the value can be bigger, for example the Mellanox HCA can support up to 2048. So the user can configure a different value, but must be small than the hardware specification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is 1024, depending on the hardware the value can be bigger, for example the Mellanox HCA can support up to 2048. So the user can configure a different value, but must be small than the hardware specification.

Can we make this value configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are using the default value every time that we create IB connection at the executor:
image

Maybe we could use some env variable to allow the user change it.

@@ -85,6 +93,7 @@ ProxyHandlerResult ProxyService::handleTrigger(ProxyTrigger triggerRaw) {
if (trigger->fields.type & TriggerSync) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just say trigger->fields.type & TriggerSync || inflightRequests > MAX_INFLIGHT_REQUEST here to merge the same block above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current way might have performance issue. When we call flush, it does busy polling until all inflight requests appear in CQ. This might waste some CPU cycles which instead could be used for submitting new requests. A better logic could be:

while (proxy_thread_running) {
  if (there_are_new_requests_to_submit) submit_requests();
  inflight += num_submitted_requests;
  if (there_are_new_requests_completed) complete_requests();
  inflight -= num_completed_requests;
}

But current connection::flush api doesn't do this. Perhaps we need to expose something like https://github.com/microsoft/mscclpp/blob/main/src/ib.cc#L288 to proxy thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants