Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise errors on unrecoverable problems #110

Merged
merged 12 commits into from
Sep 30, 2024
Merged

Raise errors on unrecoverable problems #110

merged 12 commits into from
Sep 30, 2024

Conversation

samuelallan72
Copy link
Contributor

@samuelallan72 samuelallan72 commented Sep 24, 2024

Raise errors for snap install or refresh failure. Raise an error because it's a fatal failure, and juju will auto-retry the hook, which may succeed on retry if the error was transient.

Improve blocked status message for if it detects snap not installed or service not active in the status hook.
I think Blocked is better than Error status here:

  • auto retrying the status hook isn't going to fix anything
  • blocked status lets the charm provide a more useful status message to the user
  • blocked status will automatically update to active on the next update-status hook once the issue is resolved

Partially fixes: #108

For:
- snap install or refresh failure
- snap not installed
- snap service not active

These cases are unexpected, the charm cannot recover automatically,
and the user must manually debug the issue.
Therefore, error status is more appropriate.

Partially fixes: #108
Copy link
Contributor

@rgildein rgildein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, since it make more sense than block status.
Please fix lint a and unit tests. Can we also add a simple integration test for this?

Copy link
Contributor

@chanchiwai-ray chanchiwai-ray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we insist on raising error, I recommend we rework the _on_collect_unit_status

@samuelallan72
Copy link
Contributor Author

Thanks for reviews everyone - I see this is early stage still, I probably could have left it as a draft.

@samuelallan72
Copy link
Contributor Author

Ok lints and unit tests fixed, ready for review again.

A downside I see of raising errors instead of using a custom blocked status, is that the juju status message is not as useful:

openstack-exporter/0*  error     idle   0        10.168.106.56             hook failed: "update-status"

I'm not sure if we can improve that while still following this pattern of raising the error and letting the ops framework manage the error status.

@samuelallan72
Copy link
Contributor Author

Actually I think this needs more work on the conditions to check if the exporter is running. I just observed this:

openstack-exporter/0*  error     idle   1        10.168.106.220            hook failed: "credentials-relation-joined"

I guess this is because configure is not called for credentials-relation-joined hook, so at the end of the credentials-relation-joined hook, collect_status is run, and at this point the charm has the keystone data, but the service hasn't been started yet. The service will start in the next credentials-relation-configure hook.

@samuelallan72
Copy link
Contributor Author

I updated it to avoid false positives with raising the error:

  • ensure that configuring is run for the join hooks too, so collect_status can always pick up a consistent state (otherwise it could detect that the exporter service should be running based on relations joined and data available, but in fact the charm hasn't started it yet).
  • add early returns to collect_status to avoid adding more logic

@chanchiwai-ray chanchiwai-ray self-requested a review September 26, 2024 08:00
@samuelallan72
Copy link
Contributor Author

Not sure what's going on with the integration tests. :/

@samuelallan72
Copy link
Contributor Author

Integration tests are failing because the upgrade hook is crashing on collect status RuntimeError: charmed-openstack-exporter snap service is not active, but it should be.

It's making too many false positives for error state here. If it was blocked, then this would result in a transient blocked state while the exporter wasn't running during an upgrade.

chanchiwai-ray
chanchiwai-ray previously approved these changes Sep 27, 2024
@rgildein rgildein dismissed their stale review September 30, 2024 05:17

My comments have been resolved.

@samuelallan72 samuelallan72 merged commit 9dc1d8f into main Sep 30, 2024
3 checks passed
@samuelallan72 samuelallan72 deleted the errors branch September 30, 2024 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should be in error state if snap service down and uninstallation failed
5 participants