Changing livenessProbe configuration #14

Jataki · 2024-10-11T17:22:04Z

Description

I also took the opportunity to change the env.example to .env.example

Bug fixing

AWS runs were always failing when running with daemon mode on K8s.

2024-10-11 13:35:56,469 - aws-cfcf7197 - INFO - Executing method 'rds_db_instances'
2024-10-11 13:36:03,904 - aws-cfcf7197 - INFO - 10.244.0.1:41604 - "GET /health HTTP/1.1" 200
2024-10-11 13:36:03,904 - aws-cfcf7197 - INFO - 10.244.0.1:41602 - "GET /health HTTP/1.1" 200
Retrying: 1...
Retrying: 1...
Retrying: 1...
2024-10-11 13:36:14,803 - aws-cfcf7197 - INFO - Shutting down
Retrying: 1...
2024-10-11 13:36:18,391 - aws-cfcf7197 - INFO - Waiting for application shutdown.
2024-10-11 13:36:19,561 - aws-cfcf7197 - INFO - Application shutdown complete.
2024-10-11 13:36:19,562 - aws-cfcf7197 - INFO - Finished server process [1]
Job "aws (trigger: date[2024-10-11 13:35:44 UTC], next run at: 2024-10-11 13:35:44 UTC)" raised an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/apscheduler/executors/base_py3.py", line 30, in run_coroutine_job
    retval = await job.func(*job.args, **job.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/galaxy/core/galaxy.py", line 55, in wrapper
    return await func(instance, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/galaxy/core/galaxy.py", line 164, in call_methods
    await queue.join()
  File "/usr/local/lib/python3.11/asyncio/queues.py", line 215, in join
    await self._finished.wait()
  File "/usr/local/lib/python3.11/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError

Why

The root cause is still unknown, but what's happening is quite clear. The default values for the livenessProbe are 10 seconds for the period and 1 second for the timeout. Notice the below screenshot:

During the AWS run, because the calls to their API take long, it looks like the resource occupancy makes it so that the response to the probe takes longer than 1 second - in the above screenshot, a probe was sent at 17:13:44, meaning the next probe log should have happened at 17:13:55 tops, but only happens at 17:14:00. Most likely due to the effort of these multiple long calls, the probe is taking more than a second to get a reply and decides to timeout, abruptly ending the run.

Moreover, following kube-score advice regarding the livenessProbe:

It should never, be the same as your readinessProbe.

And thus I also took the liberty of moving it to a separate /live endpoint.

Next steps

Depending on how long these calls take, there may be the need to turn this into an environment variable. Any thoughts on that?

lufinima

LGTM

Changing livenessProbe configuration

59b8b57

lufinima approved these changes Oct 11, 2024

View reviewed changes

Jataki merged commit 6cd0bff into main Oct 14, 2024
1 check passed

DeXtroTip deleted the manu-better-probe branch October 15, 2024 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing livenessProbe configuration #14

Changing livenessProbe configuration #14

Jataki commented Oct 11, 2024

lufinima left a comment

Changing livenessProbe configuration #14

Changing livenessProbe configuration #14

Conversation

Jataki commented Oct 11, 2024

Description

Bug fixing

Why

Next steps

lufinima left a comment

Choose a reason for hiding this comment