-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pebble): catch socket.timeout exception in pebble.Client.exec() #1155
fix(pebble): catch socket.timeout exception in pebble.Client.exec() #1155
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the investigation. Can we also add a regression test for this? For example, alongside test_wait_change_timeout
in test_pebble.py
, which makes the response raise socket.timeout
-- which should fail before this change and succeed after? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of minor suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also some other issues raised in #826 that I think we should either comment on somewhere (there or in this PR) or adjust. In particular:
- The Unix socket has code for providing a timeout, but I don't think there's any way for a charm (or even the Pebble client) to provide that, so it's unclear why it's there. Am I missing some way that it can be set, or some reason to provide it even though it's never used?
- The Pebble client has a timeout that's used for (I think) everything other than
exec
, but Charm's basically can't use this, because it's not exposed to the model. Is that deliberate? Should a charm be able to change this timeout? - It seems like in the original report that the timeout was happening after ~13s, but it should presumably have been 5s. Have we checked that we do see a timeout after 5s and not a longer time?
To answer the 3rd question first, the 5s timeout is It's not the timeout for executing the command created by The code in the original issue doesn't set timeout for John did mention that the 5s timeout to pebble can't be configured and I checked, it's true, it can't be set in the entire OF code: When creating a model, metadata is used, then passed to Unit, which is created by model, using _ModelCache. Then unit creates containers, container initializes pebble client. However, no where in the whole chain is pebble timeout set, so it's always the default 5s. I think if we want to expose pebble timeout as a configuration item, one way to do so is to add something like I don't know if this change is worth it, because A, 5s seems to be an OK value, and B, users probably only care about Is it worth adding this feature? Need your input here @benhoyt @tonyandrewmeyer. Another thing John mentioned was the |
This is badly named, in my opinion, but we can't do anything about that now. If I understand correctly, if I do
The wait API is a bit odd in that we are basically polling, but it seems to have been added to avoid polling. If I do
So they are basically the same underneath.
Yeah, I get that there are a bunch of different timeouts here, the Although the ticket summary is about the leaked exception, I think the broader problem (of the two issues combined) is "Pebble timeouts are confusing", and it'd be good to solve that, rather than just the specific leaking exception.
I would guess that the 13 comes from three attempts - we tried 2 times to get the change result from Pebble, waiting 4s each time, and that all worked as designed, and then on the 3rd attempt instead of the Pebble call timing out, the socket connection timed out after 5s (4+4+5=13). If so I guess this is fine, because now we'll be catching that socket error and it'll time out after 30s.
I think we probably do not want to expose this, since it's not what users would think of a timeout. I think we do want to explain in the
I looked at this a bit more closely and it's clearer to me now. |
I agree that "Pebble timeouts are confusing" -- at least this is what I was thinking when I was reading the code. I think it's too complicated to be completely refactored, also I don't know what the impact circle would be. Still, maybe we can do some minimal refactoring to make it better, even just slightly so? Off the top of my head, I have the following ideas:
|
Even though this isn't accessible directly by charms, this would break backwards compatibility so we can't really do it. I don't think there's much we can do other than make the documentation clearer on what this value is (unless we do want to expose it to charms, but as we discussed above, I don't think so).
I think a different name would have been great, but this would again break backwards compatibility, so we can't really do it (in ops 2.x). |
Yeah, and can see how the "Client.timeout" name is confusing. I agree with Tony's assessment above:
I don't think we should expose the ability for charmers to change this (no one's asked for that, and 5s has worked well). I'm on board with explaining in the |
According to our discussion, I have added a little explanation in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Let's just remove the 3 extraneous files added under docs
(do we need to add something more to .gitignore
?)
Sorry, it was a bad merge; the history is too complicated so I rebased everything to fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this!
Tried to run some
container.exec
in a test charm with sleep, with some other long-running commands like apt-update, with or without timeout, but could not reproduce the issue.Changed according to John's opinion: catch
socket.timeout
exception and continue polling till the end of the deadline.Also updated docstring for
model.Container.exec()
andpebble.Client.exec()
.Closes #826.