HPCC-31017 Report cause of k8s thorworker job failure #18142

jakesmith · 2023-12-13T15:33:15Z

Ensure that the cause of the failure to apply the k8s thorworker job is reported back to the workunit.
Also suppress follow on 'backoff' failure if the primary cause of failure has already been reported.

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

Ensure that the cause of the failure to apply the k8s thorworker job is reported back to the workunit. Also suppress follow on 'backoff' failure if the primary cause of failure has already been reported. Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>

github-actions · 2023-12-13T15:33:32Z

https://track.hpccsystems.com/browse/HPCC-31017
Jira updated

github-actions · 2023-12-13T15:43:10Z

https://track.hpccsystems.com/browse/HPCC-31017
This pull request is already registered

shamser

Please can you clarify the question regarding suppressing thormanager exit failure?

shamser · 2023-12-18T10:13:57Z

thorlcr/master/thmastermain.cpp

+                if (wu)
+                {
+                    relayWuidException(wu, exception);
+                    retCode = 0; // if successfully reported, suppress thormanager exit failure that would trigger another exception


If the exit failure is suppressed, does it mean that some error conditions will no longer be resolved thor manager exit failure handling (e.g. restarting thor)? i.e., does it mean continuing as if no error occurred, mean that the error condition could re-occur when the next job is processed?

This is containerized only, thor doesn't restart in k8s, existing instances are either reused or a new instance is spun up.
So this already reported error (to workunit) is suppressed to avoid the launcher of the instance (agentexec), which may be from a much earlier workunit, seeing the error, and reporting a k8s error "Job has reached the specified backoff limit".

In other words, if it gets this far, this thor instance has handled the error, reported it to our workflow, and is returning success to the agent to prevent it to report a spurious and confusing additional error.

@shamser - please see reply.

github-actions · 2023-12-19T17:45:45Z

https://track.hpccsystems.com/browse/HPCC-31017
This pull request is already registered

jakesmith requested a review from shamser December 13, 2023 15:34

shamser reviewed Dec 18, 2023

View reviewed changes

jakesmith requested a review from shamser December 18, 2023 17:06

shamser approved these changes Dec 19, 2023

View reviewed changes

jakesmith force-pushed the HPCC-31017-thorworker-k8s-error-report branch 2 times, most recently from 665408b to bf7d04f Compare December 19, 2023 16:54

jakesmith closed this Dec 19, 2023

jakesmith reopened this Dec 19, 2023

ghalliday merged commit 08fc9a9 into hpcc-systems:candidate-9.2.x Dec 19, 2023
83 of 87 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC-31017 Report cause of k8s thorworker job failure #18142

HPCC-31017 Report cause of k8s thorworker job failure #18142

jakesmith commented Dec 13, 2023 •

edited

Loading

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

shamser left a comment

shamser Dec 18, 2023

jakesmith Dec 18, 2023

jakesmith Dec 18, 2023

github-actions bot commented Dec 19, 2023

HPCC-31017 Report cause of k8s thorworker job failure #18142

HPCC-31017 Report cause of k8s thorworker job failure #18142

Conversation

jakesmith commented Dec 13, 2023 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

shamser left a comment

Choose a reason for hiding this comment

shamser Dec 18, 2023

Choose a reason for hiding this comment

jakesmith Dec 18, 2023

Choose a reason for hiding this comment

jakesmith Dec 18, 2023

Choose a reason for hiding this comment

github-actions bot commented Dec 19, 2023

jakesmith commented Dec 13, 2023 •

edited

Loading