-
Hi, BentoML team! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Gunicorn mainly manages the processes for flask, which BentoML was previously based. Now we've separated the web portion from the model serving portion. We're using circus at a much lower level which will manage the runner/model serving processes. With the new way you can control the number of model instances independently from the webserver instances. Separately, in order to support async apps, we transitioned from WSGI to ASGI. Instead of gunicorn and flask, we're now using starlette and asynio, which gives better performance overall. @parano Did I get that right? Any other details you'd like to add or retract? 😂 |
Beta Was this translation helpful? Give feedback.
-
Thanks, @timliubentoml. They are major reasons.
Circus meets the requirement well. |
Beta Was this translation helpful? Give feedback.
-
Exactly as @bojiang and @timliubentoml answered - besides we want to provide proper async support, the main reason is that Gunicorn, as well as most tools in the Python web development stack are designed for running multiple homogeneous processes, where all processes are running identical web serving code, and it will just fork the same process to multiple workers for vertical scaling. However this is not great for ML model serving workloads: A resource intense model may limit how many copies can fit in one machine, models will also be idle when other pre-processing, post-processing code is running, which leads to low resource utilization. In order to address this problem in BentoML 1.0, we introduced the Runner concept, which is a unit of computation(typically a Model), that will be scheduled in its own worker pool(or its own Pod in Yatai/Kubernetes deployment), that is different from the API server processes and can scale separately. Gunicorn doesn't really support this type of architecture, that's why we moved to Circus, which offers lower-level APIs to create and mange multiple processes. |
Beta Was this translation helpful? Give feedback.
-
@timliubentoml @bojiang @parano |
Beta Was this translation helpful? Give feedback.
Exactly as @bojiang and @timliubentoml answered - besides we want to provide proper async support, the main reason is that Gunicorn, as well as most tools in the Python web development stack are designed for running multiple homogeneous processes, where all processes are running identical web serving code, and it will just fork the same process to multiple workers for vertical scaling.
However this is not great for ML model serving workloads: A resource intense model may limit how many copies can fit in one machine, models will also be idle when other pre-processing, post-processing code is running, which leads to low resource utilization. In order to address this problem in BentoML 1.0, …