Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing livenessProbe configuration #14

Merged
merged 1 commit into from
Oct 14, 2024
Merged

Changing livenessProbe configuration #14

merged 1 commit into from
Oct 14, 2024

Conversation

Jataki
Copy link
Contributor

@Jataki Jataki commented Oct 11, 2024

Description

I also took the opportunity to change the env.example to .env.example

Bug fixing

AWS runs were always failing when running with daemon mode on K8s.

2024-10-11 13:35:56,469 - aws-cfcf7197 - INFO - Executing method 'rds_db_instances'
2024-10-11 13:36:03,904 - aws-cfcf7197 - INFO - 10.244.0.1:41604 - "GET /health HTTP/1.1" 200
2024-10-11 13:36:03,904 - aws-cfcf7197 - INFO - 10.244.0.1:41602 - "GET /health HTTP/1.1" 200
Retrying: 1...
Retrying: 1...
Retrying: 1...
2024-10-11 13:36:14,803 - aws-cfcf7197 - INFO - Shutting down
Retrying: 1...
2024-10-11 13:36:18,391 - aws-cfcf7197 - INFO - Waiting for application shutdown.
2024-10-11 13:36:19,561 - aws-cfcf7197 - INFO - Application shutdown complete.
2024-10-11 13:36:19,562 - aws-cfcf7197 - INFO - Finished server process [1]
Job "aws (trigger: date[2024-10-11 13:35:44 UTC], next run at: 2024-10-11 13:35:44 UTC)" raised an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/apscheduler/executors/base_py3.py", line 30, in run_coroutine_job
    retval = await job.func(*job.args, **job.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/galaxy/core/galaxy.py", line 55, in wrapper
    return await func(instance, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/galaxy/core/galaxy.py", line 164, in call_methods
    await queue.join()
  File "/usr/local/lib/python3.11/asyncio/queues.py", line 215, in join
    await self._finished.wait()
  File "/usr/local/lib/python3.11/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError

Why

The root cause is still unknown, but what's happening is quite clear. The default values for the livenessProbe are 10 seconds for the period and 1 second for the timeout. Notice the below screenshot:

image

During the AWS run, because the calls to their API take long, it looks like the resource occupancy makes it so that the response to the probe takes longer than 1 second - in the above screenshot, a probe was sent at 17:13:44, meaning the next probe log should have happened at 17:13:55 tops, but only happens at 17:14:00. Most likely due to the effort of these multiple long calls, the probe is taking more than a second to get a reply and decides to timeout, abruptly ending the run.

Moreover, following kube-score advice regarding the livenessProbe:

It should never, be the same as your readinessProbe.

And thus I also took the liberty of moving it to a separate /live endpoint.

Next steps

Depending on how long these calls take, there may be the need to turn this into an environment variable. Any thoughts on that?

Copy link
Contributor

@lufinima lufinima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jataki Jataki merged commit 6cd0bff into main Oct 14, 2024
1 check passed
@DeXtroTip DeXtroTip deleted the manu-better-probe branch October 15, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants