Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve shutdown process in Multiprocess mode #2010

Closed
wants to merge 1 commit into from

Conversation

rbtz-openai
Copy link

@rbtz-openai rbtz-openai commented Jun 17, 2023

Summary

This PR explicitly closes listening sockets in parent processe early in the shutdown process. That way, no new connections can hit workers after shutdown is initiated by the parent process.

While here, speedup shutdown by splitting process shutdown into terminate and join loop, so all child processes are shutdown in parallel.

Current behaviour

Here is a sample application:

import asyncio

from fastapi import FastAPI

app = FastAPI()


@app.get("/")
async def root():
    return {"Hello": "World"}


@app.on_event("shutdown")
async def shutdown_event():
    print("Shutdown event")
    await asyncio.sleep(10)
    print("Shutdown event done")


@app.get("/sleep")
async def sleep():
    """can be used to test graceful shutdown"""
    await asyncio.sleep(60)

It is ran with:

uvicorn app:app --workers 2

In a second window we can run simulated traffic generator, a curtesy of curl:

while :; do curl --connect-timeout 0.1 --max-time 1 localhost:8000; sleep 1; done

In third window we pkill(1) the parent:

pkill -f 'uvicorn app:app'

What is seen in uvicorn logs:

(fast) ➜  app uvicorn app:app --workers 2
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started parent process [87167]
INFO:     Started server process [87194]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [87195]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Normal traffic:

INFO:     127.0.0.1:63935 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63936 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63938 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63939 - "GET / HTTP/1.1" 200 OK

SIGTERM received:

INFO:     Shutting down
INFO:     Waiting for application shutdown.
Shutdown event

At this point only a single worker receives traffic:

INFO:     127.0.0.1:63940 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63942 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63943 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63944 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63945 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63946 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63947 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63948 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63949 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:63950 - "GET / HTTP/1.1" 200 OK
Shutdown event done
INFO:     Application shutdown complete.
INFO:     Finished server process [87194]
INFO:     Shutting down
INFO:     Waiting for application shutdown.

Now we closed the last workers' sockets but parent still has opened socket, so we are dropping requests on the floor:

Shutdown event
Shutdown event done
INFO:     Application shutdown complete.
INFO:     Finished server process [87195]
INFO:     Stopping parent process [87167]

After that point we are throwing connection-refused.

From the client it looks like:

$ while :; do curl --connect-timeout 0.1 --max-time 1 localhost:8000; sleep 1; done
{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}
# At this point we sent SIGTERM.  All the following responses come from a single alive worker:
{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}{"Hello":"World"}
# Here this worker dies and we start dropping requests on the floor (non-retrieable).
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1003 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1004 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1002 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1005 milliseconds with 0 bytes received
# Here master quits and we get clean (retriable connection refused)
curl: (7) Failed to connect to localhost port 8000 after 4 ms: Couldn't connect to server
curl: (7) Failed to connect to localhost port 8000 after 4 ms: Couldn't connect to server
curl: (7) Failed to connect to localhost port 8000 after 4 ms: Couldn't connect to server

New behaviour

New version:

  • Closes listening socket in parent process immediately.
  • Shuts down all worker processes in parallel, which has two side effects:
    • No load imbalance between workers during shutdown.
    • Faster shutdown porcess.

PS. The same situation can be reproduced w/o shutdown lifecycle event with --timeout-graceful-shutdown 30 and an endless stream of slow requests

Checklist

  • I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.

Comment on lines +66 to +68
for sock in self.sockets:
sock.close()

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to close the sockets here again?

Copy link
Author

@rbtz-openai rbtz-openai Jun 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In UNIX-based OSes one needs to close all instances of a listening socket for it to unbind (incl. ones inherited by forks). In this case this includes:

  • master process.
  • all childen.

If you're hinting that in some OS'es (e.g. Windows, where my knowledge is quite limited) this would lead to an error I can wrap this in: try/except socket.error.

Copy link
Author

@rbtz-openai rbtz-openai Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can observe the behaviour of stalled http requests during shutdown with a server that has a health check http handler and a a shutdown lifespan handler:

    @app.on_event("shutdown")
    async def shutdown():
        await asyncio.sleep(60)

@Kludex Kludex added waiting author Waiting for author's reply and removed waiting author Waiting for author's reply labels Jun 17, 2023
@rbtz-openai
Copy link
Author

@Kludex A friendly ping on this ^

@rbtz-openai rbtz-openai requested a review from Kludex July 13, 2023 19:20
Explicitly clos listening sockets in both parent and child processes early in
the shutdown process.  That way, no new connections can hit workers after
shutdown is initiated by the parent process.

While here, speedup shutdown by splitting process shutdown into terminate and
join loop, so all processes are shutdown in parallel.
Comment on lines -270 to -271
for sock in sockets or []:
sock.close()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed.

asyncio.loop.create_server specifies:

Note The sock argument transfers ownership of the socket to the server created. To close the socket, call the server’s close() method.

asyncio.Server confirms that with:

close()(https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.Server.close)
Stop serving: close listening sockets and set the sockets attribute to None.

@rbtz-openai rbtz-openai marked this pull request as ready for review July 13, 2023 22:38
@Kludex Kludex added this to the Version 0.24.0 milestone Jul 18, 2023
@rbtz-openai
Copy link
Author

ref:

@Kludex
Copy link
Sponsor Member

Kludex commented Apr 13, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants