Skip to content

Commit

Permalink
Deployed 154e471 to master with MkDocs 1.6.1 and mike 2.1.3
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Jan 23, 2025
1 parent 2d883ad commit 5c2f95e
Show file tree
Hide file tree
Showing 5 changed files with 209 additions and 190 deletions.
1 change: 1 addition & 0 deletions master/modelserving/v1beta1/llm/huggingface/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1273,6 +1273,7 @@ <h2 id="examples">Examples<a class="headerlink" href="#examples" title="Permanen
<li><a href="text_classification/">Sequence Classification (Text Classification) using distilBERT</a></li>
<li><a href="fill_mask/">Fill Mask using BERT</a></li>
<li><a href="sdk_integration/">SDK Integration</a></li>
<li><a href="multi-node/">Multi-Node Multi-GPU using Ray</a></li>
</ul>
<div class="admonition note">
<p class="admonition-title">Note</p>
Expand Down
18 changes: 18 additions & 0 deletions master/modelserving/v1beta1/llm/huggingface/multi-node/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1331,6 +1331,24 @@ <h3 id="consideration">Consideration<a class="headerlink" href="#consideration"
initialDelaySeconds: 20
..
</code></pre></div></p>
<p>Multi-node setups typically use the <code>RollingUpdate</code> deployment strategy, which ensures that the existing service remains operational until the new service becomes Ready. However, this approach requires more than twice the resources to function effectively. Therefore, during the development phase, it is more appropriate to use the <code>Recreate</code> strategy.</p>
<p><div class="highlight"><pre><span></span><code>spec:
predictor:
deploymentStrategy:
type: Recreate
model:
modelFormat:
name: huggingface
runtime: kserve-huggingfaceserver-multinode
storageUri: pvc://XXXX
workerSpec: {}
</code></pre></div>
Additionally, modifying the <code>PipelineParallelSize</code> (either increasing or decreasing it) can impact the existing service due to the default behavior of the Deployment resource. It is important to note that <strong>PipelineParallelSize is not an autoscaling concept</strong>; rather, it determines how many nodes will be used to run the model. For this reason, it is strongly recommended not to modify this setting in production environments.</p>
<p>If the <code>Recreate</code> deployment strategy is not used and you need to change the PipelineParallelSize, the best approach is to delete the existing InferenceService (ISVC) and create a new one with the desired configuration. The same recommendation applies to TensorParallelSize, as modifying this setting dynamically can also affect the service's stability and performance.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>To reiterate, <strong>PipelineParallelSize is not a general-purpose autoscaling mechanism</strong>, and changes to it should be handled with caution, especially in production environments.</p>
</div>
<h2 id="workerspec-and-servingruntime">WorkerSpec and ServingRuntime<a class="headerlink" href="#workerspec-and-servingruntime" title="Permanent link"></a></h2>
<p>To enable multi-node/multi-GPU inference, <code>workerSpec</code> must be configured in both ServingRuntime and InferenceService. The <code>huggingface-server-multinode</code> <code>ServingRuntime</code> already includes this field and is built on <strong>vLLM</strong>, which supports multi-node/multi-GPU feature. Note that this setup is <strong>not compatible with Triton</strong>.</p>
<div class="admonition note">
Expand Down
2 changes: 1 addition & 1 deletion master/search/search_index.json

Large diffs are not rendered by default.

Loading

0 comments on commit 5c2f95e

Please sign in to comment.