Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup Docker Containers immediately for samples with errors #706

Merged
merged 1 commit into from
Oct 16, 2024

Conversation

jjallaire-aisi
Copy link
Collaborator

For the Docker sandbox we have a deferred cleanup logic that allows users to see console progress for cleanup. This was initially put in because cleanup could be very long (in which case users might Ctrl+C, leaking containers).

At the time there was no concept of multiple concurrent tasks and no concept of fail_on_error, which meant at a practical level "deferred" cleanup was really just deferring a few hundred milliseconds (because any error implied that the entire eval was ending). In the meantime, a number of things have changed:

(1) We have resolved the long cleanup issues w/ a combination of init: true and stop_grace_period: 1s (which means that containers terminate typically in 1 second or less)

(2) We have added two independent ways that evals can continue in the presence of errors: (a) max_tasks > 1 means that one task can fail but others keep running; and (b) fail_on_error means that a sample can fail but the others keep running.

The implication of (2) above means that our deferral can result in many containers hanging around far longer than they are needed (whereas before this could never happen b/c errors ended the entire eval).

This change modifies our setting of the interrupted flag -- we now only set this for asyncio.CancelledError (which will deterministically end the entire eval). This means that when ordinary sample errors occur their container(s) will be cleaned up immediately.

Copy link
Collaborator

@craigwalton-dsit craigwalton-dsit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Yeah, why we did it this way initially made good sense, but I agree given the many new ways we've introduced parallelism, this new approach is safer.

I've verified this with my min repro. Thanks for seeing to this so efficiently!

@jjallaire-aisi jjallaire-aisi merged commit af177cd into main Oct 16, 2024
9 checks passed
@jjallaire-aisi jjallaire-aisi deleted the bugfix/docker-cleanup-on-error branch October 16, 2024 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants