diff --git a/docs/blog/articles/2024-12-13-KServe-0.14-release.md b/docs/blog/articles/2024-12-13-KServe-0.14-release.md index 15c6d1c84..3be13ec03 100644 --- a/docs/blog/articles/2024-12-13-KServe-0.14-release.md +++ b/docs/blog/articles/2024-12-13-KServe-0.14-release.md @@ -13,6 +13,7 @@ Inline with the features documented in issue [#3270](https://github.com/kserve/k * The clients are asynchronous * Support for HTTP/2 (via [httpx](https://www.python-httpx.org/) library) * Support Open Inference Protocol v1 and v2 +* Allow client send and receive tensor data in binary format for HTTP/REST request, see [binary tensor data extension docs](https://kserve.github.io/website/0.14/modelserving/data_plane/binary_tensor_data_extension/). As usual, the version 0.14.0 of the KServe Python SDK is [published to PyPI](https://pypi.org/project/kserve/0.14.0/) and available to install via `pip install`. @@ -39,57 +40,28 @@ Modelcars is one implementation option for supporting OCI images for model stora Using volume mounts based on OCI artifacts is the optimal implementation, but this is only [recently possible since Kubernetes 1.31](https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/) as a native alpha feature. KServe can now evolve to use this new Kubernetes feature. -## Introducing model cache +## Introducing Model Cache With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster. -The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature. It relies on a PV for storing models and it provides control about which models to store in the cache. The feature was designed to mainly to use node Filesystem as storage. Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit). +The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature. +In this release local node storage is used for storing models and `LocalModelCache` custom resource provides the control about which models to store in the cache. +The local model cache state can always be rebuilt from the models stored on persistent storage like model registry or S3. +Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit). -The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap. + -You start by creating a node group as follows: +By caching the models, you get the following benefits: -```yaml -apiVersion: serving.kserve.io/v1alpha1 -kind: LocalModelNodeGroup -metadata: - name: nodegroup1 -spec: - persistentVolumeSpec: - accessModes: - - ReadWriteOnce - volumeMode: Filesystem - capacity: - storage: 2Gi - hostPath: - path: /models - type: "" - persistentVolumeReclaimPolicy: Delete - storageClassName: standard - persistentVolumeClaimSpec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 2Gi - storageClassName: standard - volumeMode: Filesystem - volumeName: kserve +- Minimize the time it takes for LLM pods to start serving requests. -``` +- Sharing the same storage for pods scheduled on the same GPU node. -Then, you can specify to store an cache a model with the following resource: +- Model Cache allows scaling your AI workload efficiently without worrying about the slow model server container startup. -```yaml -apiVersion: serving.kserve.io/v1alpha1 -kind: ClusterLocalModel -metadata: - name: iris -spec: - modelSize: 1Gi - nodeGroup: nodegroup1 - sourceModelUri: gs://kfserving-examples/models/sklearn/1.0/model -``` +The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap. + +You can follow [local model cache tutorial](../../modelserving/storage/modelcache/localmodel.md) to cache LLMs on local NVMe of your GPU nodes and deploy LLMs with `InferenceService` by loading models from local cache to accelerate the container startup. <!-- Related tickets: @@ -122,6 +94,15 @@ Related tickets: * Implement Huggingface model download in storage initializer [#3584](https://github.com/kserve/kserve/pull/3584) --> +## Hugging Face vLLM backend changes + +* vLLM backend to update to 0.6.1 [#3948](https://github.com/kserve/kserve/pull/3948) +* Support trust_remote_code flag for vllm [#3729](https://github.com/kserve/kserve/pull/3729) +* Support text embedding task in hugging face server [#3743](https://github.com/kserve/kserve/pull/3743) +* Add health endpoint for vLLM backend [#3850](https://github.com/kserve/kserve/pull/3850) +* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791) +* Support shared memory volume for vLLM backend [#3910](https://github.com/kserve/kserve/pull/3910) + ## Other Changes This release also includes several enhancements and changes: @@ -130,9 +111,10 @@ This release also includes several enhancements and changes: * New flag for automount serviceaccount token by [#3979](https://github.com/kserve/kserve/pull/3979) * TLS support for inference loggers [#3837](https://github.com/kserve/kserve/issues/3837) * Allow PVC storage to be mounted in ReadWrite mode via an annotation [#3687](https://github.com/kserve/kserve/issues/3687) +* Support HTTP Headers passing for KServe python custom runtimes [#3669](https://github.com/kserve/kserve/pull/3669) ### What's Changed? -* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791) +* Ray is now an optional dependency [#3834](https://github.com/kserve/kserve/pull/3834) * Support for Python 3.12 is added, while support Python 3.8 is removed [#3645](https://github.com/kserve/kserve/pull/3645) For complete details on the new features and updates, visit our [official release notes](https://github.com/kserve/kserve/releases/tag/v0.14.0). diff --git a/docs/images/localmodelcache.png b/docs/images/localmodelcache.png new file mode 100644 index 000000000..035e67c22 Binary files /dev/null and b/docs/images/localmodelcache.png differ diff --git a/docs/modelserving/storage/modelcache/jobstoragecontainer.yaml b/docs/modelserving/storage/modelcache/jobstoragecontainer.yaml new file mode 100644 index 000000000..d45680393 --- /dev/null +++ b/docs/modelserving/storage/modelcache/jobstoragecontainer.yaml @@ -0,0 +1,25 @@ +apiVersion: "serving.kserve.io/v1alpha1" +kind: ClusterStorageContainer +metadata: + name: hf-hub +spec: + container: + name: storage-initializer + image: kserve/storage-initializer:latest + env: + - name: HF_TOKEN # Option 2 for authenticating with HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-secret + key: HF_TOKEN + optional: false + resources: + requests: + memory: 100Mi + cpu: 100m + limits: + memory: 1Gi + cpu: "1" + supportedUriFormats: + - prefix: hf:// + workloadType: localModelDownloadJob diff --git a/docs/modelserving/storage/modelcache/localmodel.md b/docs/modelserving/storage/modelcache/localmodel.md new file mode 100644 index 000000000..2c3113caa --- /dev/null +++ b/docs/modelserving/storage/modelcache/localmodel.md @@ -0,0 +1,212 @@ +# KServe Local Model Cache + +By caching LLM models locally, the `InferenceService` startup time can be greatly improved. For deployments with more than one replica, +the local persistent volume can serve multiple pods with the warmed up model cache. + +- `LocalModelCache` is a KServe custom resource to specify which model from persistent storage to cache on local storage of the kubernetes node. +- `LocalModelNodeGroup` is a KServe custom resource to manage the node group for caching the models and the local persistent storage. +- `LocalModelNode` is a KServe custom resource to track the status of the models cached on given local node. + +In this example, we demonstrate how you can cache the models using Kubernetes nodes' local disk NVMe volumes from HF hub. + +## Create the LocalModelNodeGroup + +Create the `LocalModelNodeGroup` using the local persistent volume with specified local NVMe volume path. + +- The `storageClassName` should be set to `local-storage`. +- The `nodeAffinity` should be specified which nodes to cache the model using node selector. +- Local path should be specified on PV as the local storage to cache the models. +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: LocalModelNodeGroup +metadata: + name: workers +spec: + storageLimit: 1.7T + persistentVolumeClaimSpec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 1700G + storageClassName: local-storage + volumeMode: Filesystem + volumeName: models + persistentVolumeSpec: + accessModes: + - ReadWriteOnce + volumeMode: Filesystem + capacity: + storage: 1700G + local: + path: /models + nodeAffinity: + required: + nodeSelectorTerms: + - key: nvidia.com/gpu-product + values: + - NVIDIA-A100-SXM4-80GB +``` + +## Configure Local Model Download Job Namespace +Before creating the `LocalModelCache` resource to cache the models, you need to make sure the credentials are configured in the download job namespace. +The download jobs are created in the configured namespace `kserve-localmodel-jobs`. In this example we are caching the models from HF hub, so the HF token secret should be created pre-hand in the same namespace +along with the storage container configurations. + +Create the HF Hub token secret. +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: hf-secret + namespace: kserve-localmodel-jobs +type: Opaque +stringData: + HF_TOKEN: xxxx # fill in the hf hub token +``` + +Create the HF Hub cluster storage container to refer to the HF Hub secret. + +```yaml +apiVersion: "serving.kserve.io/v1alpha1" +kind: ClusterStorageContainer +metadata: + name: hf-hub +spec: + container: + name: storage-initializer + image: kserve/storage-initializer:latest + env: + - name: HF_TOKEN # Option 2 for authenticating with HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-secret + key: HF_TOKEN + optional: false + resources: + requests: + memory: 100Mi + cpu: 100m + limits: + memory: 1Gi + cpu: "1" + supportedUriFormats: + - prefix: hf:// + workloadType: localModelDownloadJob +``` + + +## Create the LocalModelCache + +Create the `LocalModelCache` to specify the source model storage URI to pre-download the models to local NVMe volumes for warming up the cache. + +- `sourceModelUri` is the model persistent storage location where to download the model for local cache. +- `nodeGroups` is specified to indicate which nodes to cache the model. + + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: LocalModelCache +metadata: + name: meta-llama3-8b-instruct +spec: + sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct + modelSize: 10Gi + nodeGroups: + - workers +``` + +After `LocalModelCache` is created, KServe creates the download jobs on each node in the group to cache the model in local storage. + +```bash +kubectl get jobs meta-llama3-8b-instruct-kind-worker -n kserve-localmodel-jobs +NAME STATUS COMPLETIONS DURATION AGE +meta-llama3-8b-instruct-gptq-kind-worker Complete 1/1 4m21s 5d17h +``` + +The download job is created using the provisioned PV/PVC. +```bash +kubectl get pvc meta-llama3-8b-instruct -n kserve-localmodel-jobs +NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE +meta-llama3-8b-instruct Bound meta-llama3-8b-instruct-download 10Gi RWO local-storage <unset> 9h +``` + +## Check the LocalModelCache Status + +`LocalModelCache` shows the model download status for each node in the group. + +```bash +kubectl get localmodelcache meta-llama3-8b-instruct -oyaml +``` +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: LocalModelCache +metadata: + name: meta-llama3-8b-instruct-gptq +spec: + modelSize: 10Gi + nodeGroup: workers + sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct +status: + copies: + available: 1 + total: 1 + nodeStatus: + kind-worker: NodeDownloaded +``` + +`LocalModelNode` shows the model download status of each model expected to cache on the given node. + +```bash +kubectl get localmodelnode kind-worker -oyaml +``` + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: LocalModelNode +metadata: + name: kind-worker +spec: + localModels: + - modelName: meta-llama3-8b-instruct + sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct +status: + modelStatus: + meta-llama3-8b-instruct: ModelDownloaded +``` + +## Deploy InferenceService using the LocalModelCache + +Finally you can deploy the LLMs with `InferenceService` using the local model cache if the model has been previously cached +using the `LocalModelCache` resource by matching the model storage URI. + +The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap. + +=== "Yaml" + + ```yaml + kubectl apply -f - <<EOF + apiVersion: serving.kserve.io/v1beta1 + kind: InferenceService + metadata: + name: huggingface-llama3 + spec: + predictor: + model: + modelFormat: + name: huggingface + args: + - --model_name=llama3 + - --model_id=meta-llama/meta-llama-3-8b-instruct + storageUri: hf://meta-llama/meta-llama-3-8b-instruct + resources: + limits: + cpu: "6" + memory: 24Gi + nvidia.com/gpu: "1" + requests: + cpu: "6" + memory: 24Gi + nvidia.com/gpu: "1" + EOF + ``` \ No newline at end of file diff --git a/docs/modelserving/storage/modelcache/localmodelcache.yaml b/docs/modelserving/storage/modelcache/localmodelcache.yaml new file mode 100644 index 000000000..7e42c7f6c --- /dev/null +++ b/docs/modelserving/storage/modelcache/localmodelcache.yaml @@ -0,0 +1,9 @@ +apiVersion: serving.kserve.io/v1alpha1 +kind: LocalModelCache +metadata: + name: meta-llama3-8b-instruct +spec: + sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct + modelSize: 10Gi + nodeGroups: + - workers diff --git a/docs/modelserving/storage/modelcache/secret.yaml b/docs/modelserving/storage/modelcache/secret.yaml new file mode 100644 index 000000000..39e3a8b8a --- /dev/null +++ b/docs/modelserving/storage/modelcache/secret.yaml @@ -0,0 +1,8 @@ +apiVersion: v1 +kind: Secret +metadata: + name: hf-secret + namespace: kserve-localmodel-jobs +type: Opaque +stringData: + HF_TOKEN: xxxx # fill in the hf hub token diff --git a/docs/modelserving/storage/modelcache/storage.yaml b/docs/modelserving/storage/modelcache/storage.yaml new file mode 100644 index 000000000..2ddb89da2 --- /dev/null +++ b/docs/modelserving/storage/modelcache/storage.yaml @@ -0,0 +1,33 @@ +apiVersion: serving.kserve.io/v1alpha1 +kind: LocalModelNodeGroup +metadata: + name: workers +spec: + storageLimit: 10Gi + persistentVolumeClaimSpec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi + storageClassName: local-storage + volumeMode: Filesystem + volumeName: models + persistentVolumeSpec: + accessModes: + - ReadWriteOnce + volumeMode: Filesystem + capacity: + storage: 10Gi + local: + path: /models + persistentVolumeReclaimPolicy: Delete + storageClassName: local-storage + nodeAffinity: + required: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: + - kind-worker diff --git a/mkdocs.yml b/mkdocs.yml index 7e07f95f8..f82030c64 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -67,14 +67,15 @@ nav: - Image classification inference graph: modelserving/inference_graph/image_pipeline/README.md - Model Storage: - Storage Containers: modelserving/storage/storagecontainers.md + - Configure CA Certificate: modelserving/certificate/kserve.md - Azure: modelserving/storage/azure/azure.md - PVC: modelserving/storage/pvc/pvc.md - S3: modelserving/storage/s3/s3.md - OCI: modelserving/storage/oci.md - URI: modelserving/storage/uri/uri.md - - CA Certificate: modelserving/certificate/kserve.md - GCS: modelserving/storage/gcs/gcs.md - Hugging Face: modelserving/storage/huggingface/hf.md + - Model Cache: modelserving/storage/modelcache/localmodel.md - Model Explainability: - Concept: modelserving/explainer/explainer.md - TrustyAI Explainer: modelserving/explainer/trustyai/README.md @@ -122,8 +123,6 @@ nav: - KServe 0.9 Release: blog/articles/2022-07-21-KServe-0.9-release.md - KServe 0.8 Release: blog/articles/2022-02-18-KServe-0.8-release.md - KServe 0.7 Release: blog/articles/2021-10-11-KServe-0.7-release.md - - Articles: - - KFserving Transition: blog/articles/2021-09-27-kfserving-transition.md - Community: - How to Get Involved: community/get_involved.md - Adopters: community/adopters.md diff --git a/overrides/main.html b/overrides/main.html index fd161d471..8ad8ecc9e 100644 --- a/overrides/main.html +++ b/overrides/main.html @@ -2,6 +2,6 @@ {% block announce %} <h1> - <b>KServe v0.13 is Released</b>, <a href="/website/0.13/blog/articles/2024-05-15-KServe-0.13-release/">Read blog >></a> + <b>KServe v0.14 is Released</b>, <a href="/website/0.14/blog/articles/2024-12-13-KServe-0.14-release/">Read blog >></a> </h1> {% endblock %}