Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: ensure xDS node ID in included in NACK and connectivity errors #8103

Merged
merged 2 commits into from
Feb 20, 2025

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Feb 18, 2025

This PR modifies existing tests for NACK and connectivity errors, to ensure that the node ID is reported to the watcher in these cases.

Addresses bullet points 2 and 3 of #7931.

RELEASE NOTES: none

@easwars easwars requested a review from dfawley February 18, 2025 13:52
@easwars easwars added the Type: Feature New features or improvements in behavior label Feb 18, 2025
@easwars easwars added this to the 1.71 Release milestone Feb 18, 2025
Copy link

codecov bot commented Feb 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.21%. Comparing base (8528f43) to head (626e2e2).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8103      +/-   ##
==========================================
- Coverage   82.24%   82.21%   -0.04%     
==========================================
  Files         387      387              
  Lines       38967    38947      -20     
==========================================
- Hits        32048    32019      -29     
- Misses       5599     5606       +7     
- Partials     1320     1322       +2     

see 24 files with indirect coverage changes

@easwars easwars requested review from purnesh42H and removed request for dfawley February 18, 2025 17:30
@easwars easwars assigned purnesh42H and unassigned dfawley Feb 18, 2025
@arjan-bal arjan-bal modified the milestones: 1.71 Release, 1.72 Release Feb 19, 2025
// To prevent indefinite blocking during xDS client close, which is caused
// by a blocking backoff channel write, cancel the test context early given
// that the test is complete.
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain this more why we need to cancel explicitly for this test even though we have defer cancel() above? Why don't we need to do it on line 152 and line 133 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the ADS stream implementation works is as follows:

  • The runner goroutine runs the following logic with backoff. See: https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/transport/ads/ads_stream.go#L249. It uses the backoff.RunF method.
    • Creates a new ADS stream
    • Sends the stream on a channel that is being read by the send goroutine, to send messages on the stream
    • Calls the recv method to read messages from the stream
  • Typically, the backoff function returns immediately, but returns a duration for which to backoff.
  • The backoff.RunF method has a select on the context that it was passed and a timer created for the backoff duration. So, when the context is canceled while it is waiting for backoff to expire, the select will ensure that the function exits immediately (instead of waiting for backoff to expire).
  • So, when the xdschannel is being closed, it triggers a close of the ADS stream implementation, which cancels the context passed to the backoff.RunF method.
  • Therefore things exit cleanly even when in backoff.

But in our test, the backoff function is actually blocked on a write to a channel, which is read by the test. This means that the the backoff.RunF method is blocked on calling the passed backoff function and not in the select. So, if we don't ensure that the backoff function passed to backoff.RunF returns early (which can be done by canceling the context that is uses), then this test will run for defaultTestTimeout duration.

I've changed the test to create a separate context for the backoff function and I only cancel that here. Let me know if this sounds ok to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for explaining.

@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Feb 20, 2025
@easwars easwars merged commit c7db760 into grpc:master Feb 20, 2025
15 checks passed
@easwars easwars deleted the xds_node_id_in_nack_errors branch February 20, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants