Cleanup Docker Containers immediately for samples with errors #706

jjallaire-aisi · 2024-10-16T12:47:05Z

For the Docker sandbox we have a deferred cleanup logic that allows users to see console progress for cleanup. This was initially put in because cleanup could be very long (in which case users might Ctrl+C, leaking containers).

At the time there was no concept of multiple concurrent tasks and no concept of fail_on_error, which meant at a practical level "deferred" cleanup was really just deferring a few hundred milliseconds (because any error implied that the entire eval was ending). In the meantime, a number of things have changed:

(1) We have resolved the long cleanup issues w/ a combination of init: true and stop_grace_period: 1s (which means that containers terminate typically in 1 second or less)

(2) We have added two independent ways that evals can continue in the presence of errors: (a) max_tasks > 1 means that one task can fail but others keep running; and (b) fail_on_error means that a sample can fail but the others keep running.

The implication of (2) above means that our deferral can result in many containers hanging around far longer than they are needed (whereas before this could never happen b/c errors ended the entire eval).

This change modifies our setting of the interrupted flag -- we now only set this for asyncio.CancelledError (which will deterministically end the entire eval). This means that when ordinary sample errors occur their container(s) will be cleaned up immediately.

craigwalton-dsit

Great! Yeah, why we did it this way initially made good sense, but I agree given the many new ways we've introduced parallelism, this new approach is safer.

I've verified this with my min repro. Thanks for seeing to this so efficiently!

Cleanup Docker Containers immediately for samples with errors

3c08425

jjallaire-aisi requested a review from craigwalton-dsit October 16, 2024 12:47

craigwalton-dsit approved these changes Oct 16, 2024

View reviewed changes

jjallaire-aisi merged commit af177cd into main Oct 16, 2024
9 checks passed

jjallaire-aisi deleted the bugfix/docker-cleanup-on-error branch October 16, 2024 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup Docker Containers immediately for samples with errors #706

Cleanup Docker Containers immediately for samples with errors #706

jjallaire-aisi commented Oct 16, 2024

craigwalton-dsit left a comment

Cleanup Docker Containers immediately for samples with errors #706

Cleanup Docker Containers immediately for samples with errors #706

Conversation

jjallaire-aisi commented Oct 16, 2024

craigwalton-dsit left a comment

Choose a reason for hiding this comment