Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the releasing Zap endpoints from IO thread process #1303

Merged
merged 1 commit into from
Jan 5, 2024

Conversation

nichamon
Copy link
Collaborator

@nichamon nichamon commented Dec 4, 2023

The patch separates the process of reducing the number of Zap endpoints associated with an IO thread from the process of releasing the endpoint from the IO thread. Additionally, it defers the nullification of zap->thread until the endpoint reference reaches zero.

zap_sock and zap_ugni need to remove their events from epoll before delivering the disconnected event to applications, preventing the reception of additional epoll events. However, without this patch, this results in resetting zap->thread to NULL. Thus, the applications cannot retrive Zap thread information in the disconnected path. The patch addresses this limitation.

@nichamon
Copy link
Collaborator Author

nichamon commented Dec 4, 2023

The patch is a part of the bigger patch to improve the thread_stats command.

@nichamon nichamon force-pushed the zap-ep-release branch 2 times, most recently from abbc8eb to 2daaf53 Compare December 4, 2023 18:51
@@ -478,14 +478,23 @@ double zap_env_dbl(char *name, double default_value);
zap_err_t zap_io_thread_ep_assign(zap_ep_t ep, int tpi);

/**
* Release \c ep from the zap io thread.
* Disable \c ep assigned to the zap io thread.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation on zap_io_thread_ep_release() and zap_io_thread_ep_remove() are confusing to me. Not sure how to make it better though ... maybe just saying release means the endpoint won't be processed by the thread any more (i.e. no more event callbacks to the application) but the endpoint still keep a reference to the thread for further stat data access, while the remove means disassociate the endpoint from the thread and reducing thead's ep load? I'm not sure ...

@narategithub
Copy link
Collaborator

@nichamon I think the code looks OK to me. Just a nit on the function documentation.

The patch separates the process of reducing the number of Zap endpoints
associated with an IO thread from the process of releasing the endpoint
from the IO thread. Additionally, it defers the nullification of
`zap->thread` until the endpoint reference reaches zero.

`zap_sock` and `zap_ugni` need to remove their events from epoll before
delivering the disconnected event to applications, preventing the
reception of additional epoll events. However, without this patch, this
results in resetting `zap->thread` to NULL. Thus, the applications
cannot retrive Zap thread information in the disconnected path. The
patch addresses this limitation.
@tom95858
Copy link
Collaborator

I think this creates a race between disconnect and ldms_xprt_update. With this change isn't it possible to have an ldms_xprt_update succeed after a disconnect?

@nichamon
Copy link
Collaborator Author

nichamon commented Jan 2, 2024

I think this creates a race between disconnect and ldms_xprt_update. With this change isn't it possible to have an ldms_xprt_update succeed after a disconnect?

The patch keeps the existing logic for event removal from the epoll by zap_sock, zap_ugni, and zap_fabric, so it should not cause a race between the disconnected and update complete events. I also went through (my code inspection) every if-condition that checks for ep->thread whether it would cause any race condition by the patch but I didn't see any problems.

I made this patch so that Zap's applications can obtain the IO thread statistics of a Zap endpoint in the disconnected path.

@narategithub has suggested another change to facilitate this. The suggestion was to add a pointer to the zap_thrstat structure to the zap_ep structure so that Zap could retrieve the thread stats when ep->thread is NULL. However, this could mess up the reference count when an endpoint has been released from the IO thread and t->_n_eps and t->stats->n_ep have been decremented; although, the endpoint is still able to access the thread statistics, which is a part of the thread structure. Note that t->_n_eps and t->stats->n_ep are always in sync.

@tom95858
Copy link
Collaborator

tom95858 commented Jan 4, 2024

@nichamon thank you for looking into this...

@tom95858
Copy link
Collaborator

tom95858 commented Jan 4, 2024

Is everyone happy with this??

@narategithub
Copy link
Collaborator

@tom95858 I think this patch looks OK to me.

@tom95858 tom95858 merged commit 5e80cd8 into ovis-hpc:OVIS-4 Jan 5, 2024
14 checks passed
@nichamon nichamon deleted the zap-ep-release branch January 5, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants