Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intraprocess bond::Bond::BondStatusCB use after free #4841

Open
wants to merge 8 commits into
base: ipc
Choose a base branch
from

Conversation

ewak
Copy link
Contributor

@ewak ewak commented Jan 10, 2025

The Bond mechanism includes creation of a subscription using a reference to a member function(bondStatusCB) of the Bond class.

This member function operates on member variables.

The lifecycle_node was calling bond_.reset() which releases the memory as far as the lifecycle_node
is concerned but this is not immediately released
from the rclcpp internal mechanisms (especially intraprocess). As a result the bondStatusCB function can called after it has been freed. This use after free shows up reliably with asan when running the test_bond test.

This change allows the test_bond to suceed by
calling bond_->breakBond() (rather than bond_.reset()) to break the bond rather than expecting it to
be done cleanly by the ~Bond() destructor.

Is it enough is TBC.

Other possibilities might be to get the Bond to
inherit from std::enable_shared_from_this(), as Ros2 Nodes do, so that the pointer to the Bond member function bondStatusCB function remains valid until the subscription is released during destruction.


Basic Info

Info Please fill out this column
Ticket(s) this addresses #4691
Primary OS tested on Ubuntu
Robotic platform tested on none
Does this PR contain AI generated software? no

Description of contribution in a few bullet points

Call bond_->breakBond() in on_deactivate rather than bond_.reset() to avoid use after free caused by dangling pointer held by rclcpp intraprocess subscription callback mechanism.

Description of documentation updates required from your changes

none


Future work that may be required in bullet points

  • Consider effect of only using breakBond() rather than reset().
  • Should LifeCycleNode::destroyBond() wait for some timeout for the bond->breakBond() to take affect using waitUntilBroken() ?
  • Figure out an improvement to rclcpp intraprocess subscription mechanism to check for memory lifetime of passed in callback function.
  • Maybe make changes to bondcpp to have Bond inherit from std::enable_shared_from_this() to extend the lifetime of the subscription callback function.
  • Consider creating the bond_ as a shared_ptr rather than a unique_ptr to allow the internal mechanisms of rclcpp to hold onto valid memory until they have a chance to cleanup properly.

For Maintainers:

  • Check that any new parameters added are updated in docs.nav2.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

Copy link
Contributor

mergify bot commented Jan 10, 2025

@ewak, all pull requests must be targeted towards the main development branch.
Once merged into main, it is possible to backport to @ipc, but it must be in main
to have these changes reflected into new distributions.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 10, 2025

I thought about that potentially, but that is called in the Bond Destructor already when we reset the shared pointer [1], so that should be done first thing as the object is being destroyed. That was on my list to try to see if it virtually fixed the issue, but I think we should dig a bit deeper so that this issue doesn't come back to bite us in another place or other bond users can do this too.

Should LifeCycleNode::destroyBond() wait for some timeout for the bond->breakBond() to take affect using waitUntilBroken() ?

destroyBond resets the shared pointer which calls the ~Bond method, we should have already broken in the destructor and then waited. If it failed after waiting, we should have seen a log from ~Bond before .reset() returns.

You provide some example fixes for bond / how we use bond (use a shared pointer instead of a unique ptr; I do want to do more than just breakBond since that would still leave the timers on which I want completely off in deactivate). These are probably good ways to go that could resolve our issues to move forward quickly, but I think given the issues you report in rclcpp below:

but this is not immediately released from the rclcpp internal mechanisms (especially intraprocess).
As a result the bondStatusCB function can called after it has been freed. This use after free shows up reliably with asan when running the test_bond test.
Figure out an improvement to rclcpp intraprocess subscription mechanism to check for memory lifetime of passed in callback function.

It sounds like your confirming a bug in rclcpp for IPC handling that we should detail in the rclcpp ticket for a more general resolution ros2/rclcpp#2721. We shouldn't be having IPC callbacks triggered after the object containing them is destroyed. Its common to have helper objects composed into your node that have sub/pubs for various path, parameter, visualization, etc handling. Can you update with some detailed results that you found in that ticket for Alberto and Tomoya if there are potential action items for rclcpp IPC handling fixes that this uncovered (if any)?

[1] https://github.com/ros/bond_core/blob/ros2/bondcpp/src/bond.cpp#L163

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 11, 2025

Consider creating the bond_ as a shared_ptr rather than a unique_ptr to allow the internal mechanisms of rclcpp to hold onto valid memory until they have a chance to cleanup properly.

By the way, I tried this, it didn't work.

The rclcpp maintainers pointed to something they think is related / the cause in this comment ros2/rclcpp#2678 (comment)

@ewak
Copy link
Contributor Author

ewak commented Jan 11, 2025

See ros/bond_core#108 for changes that use enable_shared_from_this to ensure valid lifetime of the member function subscription callback. I also commented on ros2/rclcpp#2678 (comment) suggesting that a helper function like createSafeSubscriptionMemFuncCallback() could be added to the rclcpp namespace. It could have an intermediate home in nav2_utils too.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 13, 2025

I'm trying to understand better here the set of options from these PRs / issue tickets.

  1. Here in Nav2, I'm seeing that the issue is resolved by only breakBond() rather than destroying the object due to the rclcpp issue. So I'm a little unclear why the shared_ptr was migrated, unless the shared_ptr + the bondcpp PR resolves the issue. tl;dr: don't we need one or the other, for this limited scope problem? Edit: I did some testing, I don't think breakBond() is enough anymore - still seeing 4x system tests fail reliably

  2. In bondcpp, I see how that could resolve the topic within the thread from rclcpp, but wouldn't it be best to resolve it at the rclcpp level, since this only fixes bond? I assume we're on the same page about that

  3. In rclcpp, Why not have something like createSafeSubscriptionMemFuncCallback inside of rclcpp that is always used in create_subscription and/or adding stop_callbacks to rclcpp::Subscription's destructor so that resetting a subscription's shared pointer automatically handles this? I don't see that suggestion considered in the threads, which seems like the easiest, low-impact solution. Michael's suggestion of demoting everything, everywhere to weak_ptr would probably be best, but I don't think much functionally different.

@ewak
Copy link
Contributor Author

ewak commented Jan 14, 2025

I'll have a go at explaining what I currently think for each of your points,

1. Here in Nav2, I'm seeing that the issue is resolved by only `breakBond()` rather than destroying the object due to the rclcpp issue. So I'm a little unclear why the `shared_ptr` was migrated, unless the `shared_ptr` + the bondcpp PR resolves the issue. tl;dr: don't we need one or the other, for this limited scope problem?

You suggested earlier, #4841 (comment), and I was concerted that breakBond() is not enough as you also wanted the timers to be stopped. So I reverted that change.

Then for any of the the weak_ptr expiry checking mechanism to work the bond must be created(in nav2) as a shared_ptr(not a unique_ptr) to allow use of shared_from_this().

2. In bondcpp, I see how that could resolve the topic within the thread from `rclcpp`, but wouldn't it be best to resolve it at the `rclcpp` level, since this only fixes bond? I assume we're on the same page about that

I agree that it would be best to resolve the issue in rclcpp. With that in mind I explored a way that I thought might do that in a rclcpp PR ros2/rclcpp#2725. This has prompted some discussion but I think the actual changes I made are likely to be rejected in favor of a stop_callbacks mechanism.

3. In rclcpp, Why not have something like `createSafeSubscriptionMemFuncCallback` inside of rclcpp that is always used in `create_subscription` and/or adding `stop_callbacks` to `rclcpp::Subscription`'s destructor so that resetting a subscription's shared pointer automatically handles this? I don't see that suggestion considered in the threads, which seems like the easiest, low-impact solution. 

I think the argument against something like rclcpp::createSafeSubscriptionMemFuncCallback() is the onus is put on the user to use it.

IOW, The create_subscription API does not take an argument that can be used as a lifetime pointer to capture with the callback function in an internally constructed lambda. So I don't think it can be done within the current create_subscription API call.

I proposed and coded up an addition of a callback_lifetime weak_ptr to subscription options as an alternative. It works(tm) within rclcpp but still puts onus on the user to use it. ros2/rclcpp#2725

I am still trying to understand the multi-threaded executor reasoning behind the proposed subscription.stop_callbacks mechanism. I still think bondcpp would need to call it before calling reset() on its shared_ptr to the subscriber. It's not enough to put it in the rclcpp::Subscription's destructor because that won't fire until all instances of it are released. For instance, Here is a place where the WaitSet can hold a subscription shared_ptr https://github.com/ros2/rclcpp/blob/80768ed14e5ebbeb4fbbff6de42149c9c526409d/rclcpp/include/rclcpp/wait_set_policies/dynamic_storage.hpp#L205

3.1 Michael's suggestion of demoting everything, everywhere to weak_ptr would probably be best, but I don't think much functionally different.

I think the demotion of weak_ptr of everything everywhere within rclcpp ONLY handles the actual subscription entity itself. There is currently not enough information provided in the create_subscription API to handle the lifetime of the registered callback function. (Now just repeating myself I think)

Hope the above helps rather than hinders.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 14, 2025

Got it, thank you. I understood about 80% of that but enough to get the gist of the issue, but perhaps not all of the specific details of the proposed solutions. I agree that not putting the onus on the user is very important since that is a huge quality of use / life issue for ROS 2 users (and subtle to explain when it matters).

I am still trying to understand the multi-threaded executor reasoning behind the proposed subscription.stop_callbacks mechanism. I still think bondcpp would need to call it before calling reset() on its shared_ptr to the subscriber. It's not enough to put it in the rclcpp::Subscription's destructor because that won't fire until all instances of it are released. For instance, Here is a place where the WaitSet can hold a subscription shared_ptr https://github.com/ros2/rclcpp/blob/80768ed14e5ebbeb4fbbff6de42149c9c526409d/rclcpp/include/rclcpp/wait_set_policies/dynamic_storage.hpp#L205

Wouldn't Michael's comment on demoting everything to weak pointers though resolve that issue?


Currently compiling your bond branch + Nav2 and running the system tests to see what I see. I think Nav2's use of IPC might be blocked by a full solution in rclcpp, but at least allows me to do my final testing so I can finish my work, validate everything, and then hold to merge until ready. If Bond's PR can be merged and that's the only issue within Nav2 that this issue rears its head, then I suppose we could move forward here too - but I'd feel better with a full solution that removes this 'class' of problem, since we have nodes that contain objects that contain sub/pub in Nav2 in several places.

Edit: I'm still seeing my Nav2 system tests job running 45 minutes later. That usually means that every single test is failing and waiting for its full timeout before force exiting the test (usually takes ~22 minutes). This is running your bond branch + the changes in this PR. I'm curious if you tested those changes or if I should try again?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 14, 2025

I tried again without avail - still shows virtually every system test failing using this branch + the bond PR branch. I don't think this resolves the issue unfortunately :(

@ewak
Copy link
Contributor Author

ewak commented Jan 20, 2025

@SteveMacenski looking for a sanity check.

Do you expect a single threaded executor that has a single simple/plain lifecycle node, i.e does nothing but log its various state changes, to exit from its spin() (without requiring Ctrl+C) once the LifecycleNode has been transitioned to the shutdown state? Eg `ros2 lifecycle set /my_plain_lifecycle_node shutdown'

It's not doing that for me on rolling right now. Irregardless of use_intraprocess.

To my reading something has to call rclcpp::shutdown() to break the executor out of its spin()

Where I am going with this is...
Should an on_broken() callback be registered to the bond to help actually stop the executor from spin()ing, call rclcpp::shutdown() And hence actually stop and shut down the system tests rather than waiting for them to timeout.

https://github.com/ros/bond_core/blob/dbb995761b4fa81ed6b7bb87c4806e024702c531/bondcpp/include/bondcpp/bond.hpp#L98

@SteveMacenski
Copy link
Member

Do you expect a single threaded executor that has a single simple/plain lifecycle node, i.e does nothing but log its various state changes, to exit from its spin() (without requiring Ctrl+C) once the LifecycleNode has been transitioned to the shutdown state? Eg `ros2 lifecycle set /my_plain_lifecycle_node shutdown'

No, the process doesn't end, it it just ready to then be terminated & stops doing work. One can transition up a lifecycle node back up from shutdown to active again (which is a pretty common application). If you do finalize though, I'm not sure if that terminates the node, I think not, but that is a final state that cannot be re-transitioned up. Finalized is done-done.

Should an on_broken() callback be registered to the bond to help actually stop the executor from spin()ing, call rclcpp::shutdown() And hence actually stop and shut down the system tests rather than waiting for them to timeout.

I don't think so for the reasons above :-)

rclcpp::shutdown is typically called in the main() function after rclcpp::spin(node);, to run once its interrupted. But either way, the launch system or system event handler should be the one processing SIGs for shutting down the process. I don't think that's relate to our issue

@ewak
Copy link
Contributor Author

ewak commented Jan 25, 2025

So I went deep on launch_testing (https://github.com/ros2/launch/blob/rolling/launch_testing/launch_testing/legacy/__init__.py) and found out how tests shutdown things. Long story short they wrap event handling around the test process ie RegisterEventHandler(OnProcessExit( ... EmitEvent(event=Shutdown(... etc so me trying to replicate things with ros2 run is not a true representation of how the tests run.

But I have found another fix to bondcpp that allows the tests to run for me. See ros/bond_core#108

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 28, 2025

@ewak if you uncomment the bond line in our underlay repos file and insert your branch, we can test this in CI as well automatically

https://github.com/ros-navigation/navigation2/blob/main/tools/underlay.repos#L14

Should ros2/rclcpp#2725 be reviewed for inclusion or is the status of the rclcpp changes needed adjusted based on the discussions in the various rclcpp tickets? I haven't seen any movement on the tickets in the last couple of weeks and just wondering if there's anything I can do to lubricate the process if you're needing a review (i.e. ping maintainers). I think the way I read the threads is that there are other solutions in mind which take time to implement, but I just wanted to make sure that you're not blocked by anything in my semi-control.

ewak added 6 commits January 28, 2025 16:41
The Bond mechanism includes creation of a subscription
using a reference to a member function(bondStatusCB)
of the Bond class.

This member function operates on member variables.

The lifecycle_node was calling bond_.reset() which
releases the memory as far as the lifecycle_node
is concerned but this is not immediately released
from the rclcpp internal mechanisms (especially intraprocess).
As a result the bondStatusCB function can called after it
has been freed.  This use after free shows up reliably with
asan when running the test_bond test.

This change allows the test_bond to suceed by
calling bond_->breakBond() (rather than bond_.reset())
to break the bond rather than expecting it to
be done cleanly by the ~Bond() destructor.

Is it enough is TBC.

Other possibilities might be to get the Bond to
inherit from std::enable_shared_from_this(), as Ros2 Nodes do,
so that the pointer to the Bond member function bondStatusCB
function remains valid until the subscription is released
during destruction.

Signed-off-by: Mike Wake <macwake@gmail.com>
In order for Bond to inherit from std::enable_shared_from_this
to allow its lifetime to be managed to avoid a use
after free where its subcriber callback is called
before it has been cleaned up from internal rclcpp
mechanism.

This is being explored either by modification to rclcpp
or using a lambda that protects against the the
member function going out of scope by testing a
weak_ptr to shared_from_this for expiry before
the member function bondStatusCB is called.

Signed-off-by: Mike Wake <macwake@gmail.com>
Signed-off-by: Mike Wake <macwake@gmail.com>
This wraps the creation of the node currently (ipc experiment branch)
sets the node option  .use_intra_process_comms(true)

Signed-off-by: Mike Wake <macwake@gmail.com>
…bscription_callback

Signed-off-by: Mike Wake <macwake@gmail.com>
Signed-off-by: Mike Wake <macwake@gmail.com>
@ewak ewak force-pushed the feature/mw/ipc_bond_use_after_free branch from 5fa0d23 to ed40eae Compare January 28, 2025 05:42
Wait some time for tf to propogate.
Allows docking_server time to activate and
not reject initial action request.

Will try to sleep before assert to allow
logs of docking server to be produced before
being killed by end of test shutdown due to
assert.

Signed-off-by: Mike Wake <macwake@gmail.com>
@ewak ewak force-pushed the feature/mw/ipc_bond_use_after_free branch from 637f818 to 0e7e3f6 Compare January 28, 2025 12:38
@SteveMacenski
Copy link
Member

CI seems to pass now with shutdown on all the system tests! I'm going to re-run again 1-2x more just to make sure its consistent, but that's a big step forward.

On the docking changes, mind doing that on the main branch as well? We can merge that in and fix that flaky test while we're pending in rclcpp's IPC (which might be some time)

Committing full debug WIP to simplify flake8 fixup
will back out unnecessary when it works.

Signed-off-by: Mike Wake <macwake@gmail.com>
@ewak ewak force-pushed the feature/mw/ipc_bond_use_after_free branch from 7e7f019 to 10c8272 Compare January 28, 2025 20:37
@ewak
Copy link
Contributor Author

ewak commented Jan 28, 2025

CI seems to pass now with shutdown on all the system tests! I'm going to re-run again 1-2x more just to make sure its consistent, but that's a big step forward.

Woot. I wanted to get a clean build in CI before responding about where to go next on the actual changes to push for in bondcpp and rclcpp.

On the docking changes, mind doing that on the main branch as well? We can merge that in and fix that flaky test while we're pending in rclcpp's IPC (which might be some time)

Yes I will make a clean PR for the docking test flakiness once I am sure its working.

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
nav2_costmap_2d/src/costmap_2d_cloud.cpp 97.22% <100.00%> (+97.22%) ⬆️
nav2_util/include/nav2_util/lifecycle_node.hpp 66.66% <ø> (ø)
nav2_util/src/lifecycle_node.cpp 95.23% <100.00%> (+0.07%) ⬆️

... and 90 files with indirect coverage changes

@SteveMacenski
Copy link
Member

It passed again - trying once more!

@ewak
Copy link
Contributor Author

ewak commented Jan 29, 2025

On the docking changes, mind doing that on the main branch as well? We can merge that in and fix that flaky test while we're pending in rclcpp's IPC (which might be some time)

Yes I will make a clean PR for the docking test flakiness once I am sure its working.

See #4879

@SteveMacenski
Copy link
Member

Woot. I wanted to get a clean build in CI before responding about where to go next on the actual changes to push for in bondcpp and rclcpp.

Let me know how I can help 🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants