Deployed 154e471 to master with MkDocs 1.6.1 and mike 2.1.3

kserve · Jan 23, 2025 · 5c2f95e · 5c2f95e
1 parent 2d883ad
commit 5c2f95e
Show file tree

Hide file tree

Showing 5 changed files with 209 additions and 190 deletions.
diff --git a/master/modelserving/v1beta1/llm/huggingface/index.html b/master/modelserving/v1beta1/llm/huggingface/index.html
@@ -1273,6 +1273,7 @@ <h2 id="examples">Examples<a class="headerlink" href="#examples" title="Permanen
 <li><a href="text_classification/">Sequence Classification (Text Classification) using distilBERT</a></li>
 <li><a href="fill_mask/">Fill Mask using BERT</a></li>
 <li><a href="sdk_integration/">SDK Integration</a></li>
+<li><a href="multi-node/">Multi-Node Multi-GPU using Ray</a></li>
 </ul>
 <div class="admonition note">
 <p class="admonition-title">Note</p>

diff --git a/master/modelserving/v1beta1/llm/huggingface/multi-node/index.html b/master/modelserving/v1beta1/llm/huggingface/multi-node/index.html
@@ -1331,6 +1331,24 @@ <h3 id="consideration">Consideration<a class="headerlink" href="#consideration"
         initialDelaySeconds: 20
 ..
 </code></pre></div></p>
+<p>Multi-node setups typically use the <code>RollingUpdate</code> deployment strategy, which ensures that the existing service remains operational until the new service becomes Ready. However, this approach requires more than twice the resources to function effectively. Therefore, during the development phase, it is more appropriate to use the <code>Recreate</code> strategy.</p>
+<p><div class="highlight"><pre><span></span><code>spec:
+  predictor:
+    deploymentStrategy:
+      type: Recreate
+    model:
+      modelFormat:
+        name: huggingface
+      runtime: kserve-huggingfaceserver-multinode
+      storageUri: pvc://XXXX
+    workerSpec: {}
+</code></pre></div>
+Additionally, modifying the <code>PipelineParallelSize</code> (either increasing or decreasing it) can impact the existing service due to the default behavior of the Deployment resource. It is important to note that <strong>PipelineParallelSize is not an autoscaling concept</strong>; rather, it determines how many nodes will be used to run the model. For this reason, it is strongly recommended not to modify this setting in production environments.</p>
+<p>If the <code>Recreate</code> deployment strategy is not used and you need to change the PipelineParallelSize, the best approach is to delete the existing InferenceService (ISVC) and create a new one with the desired configuration. The same recommendation applies to TensorParallelSize, as modifying this setting dynamically can also affect the service's stability and performance.</p>
+<div class="admonition note">
+<p class="admonition-title">Note</p>
+<p>To reiterate, <strong>PipelineParallelSize is not a general-purpose autoscaling mechanism</strong>, and changes to it should be handled with caution, especially in production environments.</p>
+</div>
 <h2 id="workerspec-and-servingruntime">WorkerSpec and ServingRuntime<a class="headerlink" href="#workerspec-and-servingruntime" title="Permanent link">¶</a></h2>
 <p>To enable multi-node/multi-GPU inference,  <code>workerSpec</code> must be configured in both ServingRuntime and InferenceService. The <code>huggingface-server-multinode</code> <code>ServingRuntime</code> already includes this field and is built on <strong>vLLM</strong>, which supports multi-node/multi-GPU feature. Note that this setup is <strong>not compatible with Triton</strong>.</p>
 <div class="admonition note">

diff --git a/master/search/search_index.json b/master/search/search_index.json