Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flow run fails when newly started worker tries to create Kubernetes secret concurrently in concurrent create job function calls #16447

Open
tanchangsheng opened this issue Dec 19, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@tanchangsheng
Copy link

Bug summary

I noticed some of the flow runs are failing when multiple flow runs are scheduled to run consecutively.

These errors are observed in the flow run log.
image
image

Tried to do a bit of investigation based on the above error logs and noticed the following:

  • The source of the error is traced back to the prefect Kubernetes worker node's _upsert_secret function.
  • When the worker creates a Kubernetes job, it performs _upsert_secret, where it attempts to upsert the kubernetes secret (prefect API key in the format <worker_name>-<api-key> )
  • However, when a worker is first created, the secret does not exist and is only created in _upsert_secret when it creates a job.
  • When a worker polls for flow runs from a work pool, it fetches all the scheduled flow runs and submits all the flow runs for execution (create a job for each flow run)
  • Hence, when a worker executes consecutive create job functions, there will be concurrent calls to _upsert_secret, which triggers the above error that results in the job crash.
  • This is because in _upsert_secret, the worker first checks if the secret exists. If it doesn't, it will attempt to create the secret. When > 1 _upsert_secret is performed concurrently, there is a chance where a few of them determine that the secret does not exist and attempt to create the secret. The crash happens when the secret is already successfully created by one of them first and the others will experience a 409 conflict error here, which is not caught and handled gracefully.

Version info

Version:             2.14.20
API version:         0.8.4
Python version:      3.11.7
Git commit:          8ceb0962
Built:               Thu, Feb 1, 2024 6:30 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         cloud

Additional context

To reproduce the issue we need

  • Newly created worker that has not created any jobs.
  • Newly created worker receives > 1 scheduled flow run after polling from work pool.
@tanchangsheng tanchangsheng added the bug Something isn't working label Dec 19, 2024
@tanchangsheng tanchangsheng changed the title Flow run fails when worker tries to create Kubernetes secret concurrently across consecutive flow run starts Flow run fails when newly started worker tries to create Kubernetes secret concurrently in concurrent create job function calls Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant