You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed some of the flow runs are failing when multiple flow runs are scheduled to run consecutively.
These errors are observed in the flow run log.
Tried to do a bit of investigation based on the above error logs and noticed the following:
The source of the error is traced back to the prefect Kubernetes worker node's _upsert_secret function.
When the worker creates a Kubernetes job, it performs _upsert_secret, where it attempts to upsert the kubernetes secret (prefect API key in the format <worker_name>-<api-key> )
However, when a worker is first created, the secret does not exist and is only created in _upsert_secret when it creates a job.
When a worker polls for flow runs from a work pool, it fetches all the scheduled flow runs and submits all the flow runs for execution (create a job for each flow run)
Hence, when a worker executes consecutive create job functions, there will be concurrent calls to _upsert_secret, which triggers the above error that results in the job crash.
This is because in _upsert_secret, the worker first checks if the secret exists. If it doesn't, it will attempt to create the secret. When > 1 _upsert_secret is performed concurrently, there is a chance where a few of them determine that the secret does not exist and attempt to create the secret. The crash happens when the secret is already successfully created by one of them first and the others will experience a 409 conflict error here, which is not caught and handled gracefully.
Version info
Version: 2.14.20
API version: 0.8.4
Python version: 3.11.7
Git commit: 8ceb0962
Built: Thu, Feb 1, 2024 6:30 PM
OS/Arch: linux/x86_64
Profile: default
Server type: cloud
Additional context
To reproduce the issue we need
Newly created worker that has not created any jobs.
Newly created worker receives > 1 scheduled flow run after polling from work pool.
The text was updated successfully, but these errors were encountered:
tanchangsheng
changed the title
Flow run fails when worker tries to create Kubernetes secret concurrently across consecutive flow run starts
Flow run fails when newly started worker tries to create Kubernetes secret concurrently in concurrent create job function calls
Dec 19, 2024
Bug summary
I noticed some of the flow runs are failing when multiple flow runs are scheduled to run consecutively.
These errors are observed in the flow run log.
Tried to do a bit of investigation based on the above error logs and noticed the following:
<worker_name>-<api-key>
)Version info
Additional context
To reproduce the issue we need
The text was updated successfully, but these errors were encountered: