Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: recursive RLock() mutex acquision #126

Merged
merged 1 commit into from
Jan 9, 2025

Conversation

smira
Copy link
Contributor

@smira smira commented Dec 24, 2024

There were at least two code paths which might acquire .RLock() recursively:

  1. Setup*() -> createResult()
  2. Setup*() -> Networks().

On its own, it's not a problem, but if .Load() is called concurrently, it might do .Lock() in between recursive .RLock()s, which would lead to a deadlock.

See #125

Fix by introducing a contract on the way mutex is acquired.

Add a test facility to verify for such mistakes via build tags:

  • go test -race or go test -tags deadlocks activates deadlock detection

There were at least two code paths which might acquire `.RLock()`
recursively:

1. `Setup*()` -> `createResult()`
2. `Setup*()` -> `Networks()`.

On its own, it's not a problem, but if `.Load()` is called concurrently,
it might do `.Lock()` in between recursive `.RLock()`s, which would lead
to a deadlock.

See containerd#125

Fix by introducing a contract on the way mutex is acquired.

Add a test facility to verify for such mistakes via build tags:

* `go test -race` or `go test -tags deadlocks` activates deadlock
  detection

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
@buroa
Copy link

buroa commented Dec 24, 2024

Thanks for taking a look at this @smira ❤️

smira added a commit to smira/pkgs that referenced this pull request Dec 24, 2024
@smira
Copy link
Contributor Author

smira commented Dec 24, 2024

Thanks for taking a look at this @smira ❤️

Thanks for finding the bug, fixing is easy :)

@MikeZappa87
Copy link
Contributor

MikeZappa87 commented Dec 24, 2024

@mikebrow this was my original approach, I thought this was the correct way as I was concerned about a deadlock with the merged approach.

smira added a commit to smira/pkgs that referenced this pull request Dec 26, 2024
Fixes siderolabs/talos#9984

Patch with containerd/go-cni#126

See also:

* containerd/go-cni#125
* containerd/containerd#11186
* containerd/go-cni#123

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 0b00e86)
"github.com/sasha-s/go-deadlock"
)

type RWMutex = deadlock.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an interface for sync.RWMutex seems easier to manage and test explicitly for than using build tags?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well my point is you can have an interface in the struct and set it to use the deadlock.RWMutex in a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible, but build tags allows to exclude deadlock.RWMutex completely from the build if it's not race-enabled or has an explicit deadlocks build tag.

plus interface indirection has a slight performance hit

this change is equivalent to what it was before if not using race/build tag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm good either way.. not worried about performance of interface indirection for network device setup/tear down :-) Fair to change it in a separated PR if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the issue here is, who is running this test?
CI is not setup for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is setup for it, it runs go test -race which is enough to activate deadlock library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I was doing a find in the ci.yml I see its in the actual Makefile.

@mikebrow
Copy link
Member

mikebrow commented Jan 2, 2025

@mikebrow this was my original approach, I thought this was the correct way as I was concerned about a deadlock with the merged approach.
nod...

the new contract
// Mutex contract:
// - lock in public methods: write lock when mutating the state, read lock when reading the state.
// - never lock in private methods.
is consistent with the "or" path here #123 (comment)

Have not seen the failure path, exactly, yet, but this pattern seems fine and is easier to follow.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MikeZappa87 MikeZappa87 requested a review from cpuguy83 January 8, 2025 15:27
@smira
Copy link
Contributor Author

smira commented Jan 8, 2025

Have not seen the failure path, exactly, yet, but this pattern seems fine and is easier to follow.

It was in the failure path:

  1. Setup*() -> createResult() (createResult and Setup* both acquire it, and createResult is private)
  2. Setup*() -> Networks(). (Setup calls public Networks, and both acquire it)

@mikebrow mikebrow merged commit f4736bb into containerd:main Jan 9, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants