-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-31017 Report cause of k8s thorworker job failure #18142
HPCC-31017 Report cause of k8s thorworker job failure #18142
Conversation
Ensure that the cause of the failure to apply the k8s thorworker job is reported back to the workunit. Also suppress follow on 'backoff' failure if the primary cause of failure has already been reported. Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
https://track.hpccsystems.com/browse/HPCC-31017 |
https://track.hpccsystems.com/browse/HPCC-31017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you clarify the question regarding suppressing thormanager exit failure?
if (wu) | ||
{ | ||
relayWuidException(wu, exception); | ||
retCode = 0; // if successfully reported, suppress thormanager exit failure that would trigger another exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the exit failure is suppressed, does it mean that some error conditions will no longer be resolved thor manager exit failure handling (e.g. restarting thor)? i.e., does it mean continuing as if no error occurred, mean that the error condition could re-occur when the next job is processed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is containerized only, thor doesn't restart in k8s, existing instances are either reused or a new instance is spun up.
So this already reported error (to workunit) is suppressed to avoid the launcher of the instance (agentexec), which may be from a much earlier workunit, seeing the error, and reporting a k8s error "Job has reached the specified backoff limit".
In other words, if it gets this far, this thor instance has handled the error, reported it to our workflow, and is returning success to the agent to prevent it to report a spurious and confusing additional error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shamser - please see reply.
665408b
to
bf7d04f
Compare
08fc9a9
into
hpcc-systems:candidate-9.2.x
https://track.hpccsystems.com/browse/HPCC-31017 |
Ensure that the cause of the failure to apply the k8s thorworker job is reported back to the workunit.
Also suppress follow on 'backoff' failure if the primary cause of failure has already been reported.
Type of change:
Checklist:
Smoketest:
Testing: