Skip to content

Commit

Permalink
update multinode doc and add a link (#439)
Browse files Browse the repository at this point in the history
* update multinode doc and add a link

Signed-off-by: jooho lee <jlee@redhat.com>

* Update docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

Co-authored-by: Edgar Hernández <ehernand@redhat.com>
Signed-off-by: Jooho Lee <ljhiyh@gmail.com>

* Update docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Signed-off-by: Jooho Lee <ljhiyh@gmail.com>

* Update docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md

Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Signed-off-by: Jooho Lee <ljhiyh@gmail.com>

* Update README.md

Signed-off-by: Jooho Lee <ljhiyh@gmail.com>

---------

Signed-off-by: jooho lee <jlee@redhat.com>
Signed-off-by: Jooho Lee <ljhiyh@gmail.com>
Co-authored-by: Edgar Hernández <ehernand@redhat.com>
Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
  • Loading branch information
3 people authored Jan 23, 2025
1 parent 70ec953 commit 154e471
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/modelserving/v1beta1/llm/huggingface/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ The following examples demonstrate how to deploy and perform inference using the
- [Sequence Classification (Text Classification) using distilBERT](text_classification/README.md)
- [Fill Mask using BERT](fill_mask/README.md)
- [SDK Integration](sdk_integration/README.md)
- [Multi-Node Multi-GPU using Ray](multi-node/README.md)

!!! note
The Hugging Face runtime image has the following environment variables set by default:
Expand Down
23 changes: 23 additions & 0 deletions docs/modelserving/v1beta1/llm/huggingface/multi-node/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,29 @@ You can set StartupProbe in ServingRuntime for your own situation.
..
~~~
Multi-node setups typically use the `RollingUpdate` deployment strategy, which ensures that the existing service remains operational until the new service becomes Ready. However, this approach requires more than twice the resources to function effectively. Therefore, during the development phase, it is more appropriate to use the `Recreate` strategy.
~~~
spec:
predictor:
deploymentStrategy:
type: Recreate
model:
modelFormat:
name: huggingface
runtime: kserve-huggingfaceserver-multinode
storageUri: pvc://XXXX
workerSpec: {}
~~~
Additionally, modifying the `PipelineParallelSize` (either increasing or decreasing it) can impact the existing service due to the default behavior of the Deployment resource. It is important to note that **PipelineParallelSize is not an autoscaling concept**; rather, it determines how many nodes will be used to run the model. For this reason, it is strongly recommended not to modify this setting in production environments.
If the `Recreate` deployment strategy is not used and you need to change the PipelineParallelSize, the best approach is to delete the existing InferenceService (ISVC) and create a new one with the desired configuration. The same recommendation applies to TensorParallelSize, as modifying this setting dynamically can also affect the service's stability and performance.
!!! note
To reiterate, **PipelineParallelSize is not a general-purpose autoscaling mechanism**, and changes to it should be handled with caution, especially in production environments.
## WorkerSpec and ServingRuntime
To enable multi-node/multi-GPU inference, `workerSpec` must be configured in both ServingRuntime and InferenceService. The `huggingface-server-multinode` `ServingRuntime` already includes this field and is built on **vLLM**, which supports multi-node/multi-GPU feature. Note that this setup is **not compatible with Triton**.
Expand Down

0 comments on commit 154e471

Please sign in to comment.