Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates #78

kaushikmitr · 2024-12-08T05:27:22Z

This pull request includes updates to the documentation and significant enhancements to the request scheduling and handling logic in the pkg/ext-proc package. The most important changes involve refining the handling of response headers, improving the scheduling algorithms for better load balancing and resource utilization, and updating the README file to include new scheduling flowchart.

Documentation Updates:

pkg/README.md: Updated the deployment steps to include new configurations for LLM Service and LLMServerPool, and restructured the steps for deploying the gateway and ext-proc. Added a new section on the scheduling package and included a flowchart for the scheduling algorithm. [1] [2] [3]

Response Handling Enhancements:

pkg/ext-proc/handlers/response.go: Modified the HandleResponseHeaders method to include additional headers when a target pod is specified, and added a fallback response for cases where no target pod is provided. [1] [2]

Scheduling Algorithm Improvements:

pkg/ext-proc/scheduling/filter.go: Added new filter functions lowQueuingFilterFunc, loRAAffinityPredicate, and minLoRAPredicate to enhance the scheduling logic for better load balancing and resource utilization. [1] [2]
pkg/ext-proc/scheduling/scheduler.go: Introduced new queueing thresholds and updated the scheduling filters to incorporate the new filter functions. Added lowLatencyFilterModified to dynamically adjust scheduling based on queueing thresholds. [1] [2] [3]

modify filter for LoRA affinity

liu-cong · 2024-12-09T18:20:22Z

It'll be easier to split the algo change and the user guide change (I think they are not coupled). And we can get the user guide update checked in much faster.

Did you run some benchmarks to see how the updated algo improve things?

kaushikmitr · 2024-12-09T20:30:50Z

This PR only has the filter related changes. The manifest updates are in a separate PR: #81

kaushikmitr · 2024-12-09T20:31:42Z

Yes, we have new benchmarks based on the updated scheduling logic shared internally

kfswain · 2024-12-09T20:35:16Z

Yes, we have new benchmarks based on the updated scheduling logic shared internally

Can we share something external? This is an OSS repo, and anyone in the future that wants to follow the story of this PR should be able to have all the context

kaushikmitr · 2024-12-09T21:11:07Z

Yes, we have new benchmarks based on the updated scheduling logic shared internally

Can we share something external? This is an OSS repo, and anyone in the future that wants to follow the story of this PR should be able to have all the context

Yes, I absolutely agree. We should definitely have an external version of our benchmarking docs accompanying this PR, and the OSS repo in general. Lets discuss more.

pkg/ext-proc/scheduling/filter.go

pkg/ext-proc/scheduling/scheduler.go

pkg/ext-proc/scheduling/filter.go

liu-cong · 2024-12-09T19:59:11Z

pkg/ext-proc/scheduling/scheduler.go

+			nextOnFailure: &filter{
+				name:                   "min cost LoRA",
+				filter:                 toFilterFunc(minLoRAPredicate),
+				nextOnSuccessOrFailure: lowLatencyFilterNoLoRA,


I wonder what happens if you just use the lowLatencyFilterLoRA filter. If that works well, then we don't need the lowLatencyFilterNoLoRA. It will make the code much cleaner.

lowLatencyFilterLoRA is needed when we first prioritze lora affinity and followed by lowLatencyFilterNoLoRA (queueing + least KV Cache). We can probably reuse lowLatencyFilterLoRA but it would be very confusing

I had the same thought, I was also expecting that we will reuse lowLatencyFilterLoRA, and so this filter will look like:

lowLatencyFilterModified = &filter{ name: "low queueing filter", filter: toFilterFunc((lowQueueingPodPredicate)), nextOnSuccessOrFailure: lowLatencyFilterLoRA }

I see, so lowLatencyFilterLoRA prioritized queuing over LoRA Affinity (this is how we had it originally) i.e. Least Queueing -> Min Cost LoRA -> Least KV Cache

oh, I just noticed that we flip the order here compared to lowLatencyFilterLoRA

Here we do: LoRA -> queue length -> kv-cache
There we do: queue length -> LoRA -> kv-cache

is this by design?

yes, so i realized the names are very confusing. So I renamed the filters. Yes its by design to make LoRA Affinity stronger.

examples/poc/manifests/vllm/vllm-lora-deployment.yaml

kaushikmitr · 2024-12-10T01:30:49Z

Yes, we have new benchmarks based on the updated scheduling logic shared internally

Can we share something external? This is an OSS repo, and anyone in the future that wants to follow the story of this PR should be able to have all the context

Yes, I absolutely agree. We should definitely have an external version of our benchmarking docs accompanying this PR, and the OSS repo in general. Lets discuss more.

@kfswain @ahg-g created this issue to create external benchmarking doc: #88

ahg-g · 2024-12-10T01:41:31Z

pkg/README.md


-1. **Update Envoy Gateway Config to enable Patch Policy**
+2. **Deploy LLM Service and LLMServerPool**


if you keep them as "1.", then they will automatically be set as a sequence

please revert to the "1." format so that the list numbers are automatically set.

ahg-g · 2024-12-10T01:48:42Z

pkg/ext-proc/handlers/response.go

-									Key:      "x-went-into-resp-headers",
-									RawValue: []byte("true"),
+	var resp *extProcPb.ProcessingResponse
+	if reqCtx.TargetPod != nil {


Why do we need to set the targetPod in the response header?

also, if we must, then you can do this:

headers := []*configPb.HeaderValueOption{ { Header: &configPb.HeaderValue{ // This is for debugging purpose only. Key: "x-went-into-resp-headers", RawValue: []byte("true"), }, }, } if reqCtx.TargetPod != nil { headers = append(headers, &configPb.HeaderValueOption{ Header: &configPb.HeaderValue{ Key: "x-target-pod", RawValue: []byte(targetpod.Name), }, }) } resp = &extProcPb.ProcessingResponse{ ..... SetHeaders: headers }

This is purely for debug purpose, not really needed. I thought it might be useful to the user.

ok, can you refactor the code as suggested above please?

let me remove it for now this change is unrelated to the main goal of this PR.

ahg-g · 2024-12-10T01:56:06Z

pkg/ext-proc/scheduling/filter.go

@@ -121,6 +121,11 @@ func leastQueuingFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*bac
 	return filtered, nil
 }

+// loRAAffinityPredicate is a filter function to check whether a pod has affinity to the lora requested.


update the comment

ahg-g · 2024-12-10T02:01:31Z

pkg/ext-proc/scheduling/scheduler.go

 		nextOnFailure: sheddableRequestFilter,
 	}

 	// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower
 	// cost to load an adapter and has low KV cache, which typically yields lower latency.
-	lowLatencyFilter = &filter{
+	lowLatencyFilterLoRA = &filter{
 		name:   "least queuing",


update the name?

I changed the filter names to be more descriptive.

ahg-g · 2024-12-10T02:16:49Z

pkg/ext-proc/scheduling/scheduler.go

+			nextOnFailure: &filter{
+				name:                   "min cost LoRA",
+				filter:                 toFilterFunc(minLoRAPredicate),
+				nextOnSuccessOrFailure: lowLatencyFilterNoLoRA,


I had the same thought, I was also expecting that we will reuse lowLatencyFilterLoRA, and so this filter will look like:

lowLatencyFilterModified = &filter{ name: "low queueing filter", filter: toFilterFunc((lowQueueingPodPredicate)), nextOnSuccessOrFailure: lowLatencyFilterLoRA }

ahg-g · 2024-12-10T03:07:20Z

pkg/ext-proc/scheduling/scheduler.go

@@ -29,7 +33,7 @@ var (

 	// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower


update the comment

ahg-g · 2024-12-10T03:16:11Z

pkg/ext-proc/scheduling/scheduler.go

+		filter: toFilterFunc((lowQueueingPodPredicate)),
+		nextOnSuccess: &filter{
+			name:          "affinity LoRA",
+			filter:        toFilterFunc(loRAAffinityPredicate),


why not use lowLoRACostPredicate with nextOnSuccessOrFailure: queueAndKVCacheFilter instead of doing loRAAffinityPredicate and canAcceptNewLoraPredicate separately?

lowLoRACostPredicate picks both pods with canAcceptNewLoraPredicate and loRAAffinityPredicate, For stronger affinity we want to pick only pods with loRAAffinityPredicate and if no such pod is present only then pick canAcceptNewLoraPredicate

Why not do that for the other branch too then?

The lowLoRACostPredicate ensures weak affinity by spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod. This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters. loRAAffinityPredicate on the other hand ensures strong affinity i.e it pins requests to a single pod with that adapter. Depending on the scenario one or the other might be better.

Can we document this reasoning please?

I added a comment to lowLoRACostPredicate with the reasoning, like we have in leastKVCacheFilterFunc.

ahg-g · 2024-12-10T05:05:49Z

/label tide/merge-method-squash

liu-cong · 2024-12-10T05:25:24Z

pkg/ext-proc/scheduling/scheduler.go

+			nextOnFailure: &filter{
+				name:                   "can accept LoRA Adapter",
+				filter:                 toFilterFunc(canAcceptNewLoraPredicate),
+				nextOnSuccessOrFailure: queueAndKVCacheFilter,


I think if we replace queueAndKVCacheFilter here and 4 lines above with queueLoRAAndKVCacheFilter, the effect should be the same? queueLoRAAndKVCacheFilter will add the lowCostLoRA filter in between, but given the pods are already filtered by lora affinity, it should be a noop.

This will simplify the code, however at the cost of potentially more confusion with the noop step. It's up to you.

I agree but also think this will make it more confusing, also i think queueAndKVCacheFilter is something we might need in future. For example, when we the request does not have need a lora adapter we can directly apply queueAndKVCacheFilter instead of checking for lora affinity.

liu-cong · 2024-12-10T05:31:30Z

/lgtm

Thanks for the deep dives and improving the algo! Unfortunately the decision tree is getting more complex and hopefully we can find ways to simply it in the future.

One potential is to at least simplify the non-LoRA mode. We can add a flag to indicate whether LoRA is enabled or not. Another reason to do this is that vLLM doesn't expose LoRA metrics when LoRA is not enabled, which resulted in noisy error logs.

liu-cong · 2024-12-10T05:35:48Z

pkg/ext-proc/scheduling/filter.go

-// model server has room to load the adapter
+// model server has room to load the adapter. The lowLoRACostPredicate ensures weak affinity by spreading the 
+// load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod. 
+// This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters. 
 func lowLoRACostPredicate(req *LLMRequest, pod *backend.PodMetrics) bool {


Leave this comment here but this doesn't need to be addressed in this PR.

We can potentially refactor this predicate to prefer the affinity first, then fall back to canAcceptNewLoRA if no affinity is found. In this case we should be able to consolidate much of the different decision tree branches. Will of course need some benchmark to see the impact.

ahg-g · 2024-12-10T06:03:53Z

/approve
/lgtm

k8s-ci-robot · 2024-12-10T06:04:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kaushikmitr and others added 4 commits November 20, 2024 20:47

squashed modify filter for LoRA affinity

dcd4109

modify filter for LoRA affinity

Merge branch 'kubernetes-sigs:main' into main

bf33d74

update llm service and llm server pool yaml, readme

235cfca

remove ununsed method from metrics.go

5d3bcae

k8s-ci-robot requested review from kfswain and robscott December 8, 2024 05:27

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 8, 2024

kaushikmitr and others added 4 commits December 8, 2024 05:37

add flowchart image

0e27908

update size flowchart image

adb9f8b

remove image name

fb9aebe

update queueingThresholdLoRA to 50

481ec1d

kaushikmitr added 2 commits December 9, 2024 20:25

roll back manifest changes

9af96d4

roll back manifest changes

41d18fd

kaushikmitr changed the title ~~Enhancements to LLM Instance Gateway: Scheduling Logic, Manifests, and Documentation Updates~~ Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates Dec 9, 2024

liu-cong reviewed Dec 9, 2024

View reviewed changes

kaushikmitr mentioned this pull request Dec 9, 2024

Tune the current LoRA-affinity algorithm to reduce flapping #75

Closed

update filter and scheduler based on comments

474a95f

ahg-g reviewed Dec 10, 2024

View reviewed changes

rename filters

bbb343b

ahg-g reviewed Dec 10, 2024

View reviewed changes

update filter names and comments

b858dd2

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 10, 2024

ahg-g reviewed Dec 10, 2024

View reviewed changes

kaushikmitr added 3 commits December 10, 2024 03:35

fix readme

6cd2752

fix comment

93dfe94

modify flowchart

5dcf1a8

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 10, 2024

add comment to lowLoRACostPredicate reasoning when it can be useful

b8e18cb

liu-cong reviewed Dec 10, 2024

View reviewed changes

k8s-ci-robot assigned liu-cong Dec 10, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2024

liu-cong reviewed Dec 10, 2024

View reviewed changes

k8s-ci-robot assigned ahg-g Dec 10, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2024

k8s-ci-robot merged commit 5372efb into kubernetes-sigs:main Dec 10, 2024
2 checks passed

liu-cong mentioned this pull request Dec 10, 2024

Explore if we can simplify the filter tree in pkg/ext_proc/scheduling/scheduler.go #89

Open


		1. Update Envoy Gateway Config to enable Patch Policy
		2. Deploy LLM Service and LLMServerPool

		@@ -29,7 +33,7 @@ var (

		// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower

Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates #78

Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates #78

Conversation

kaushikmitr commented Dec 8, 2024 • edited Loading

Documentation Updates:

Response Handling Enhancements:

Scheduling Algorithm Improvements:

liu-cong commented Dec 9, 2024

kaushikmitr commented Dec 9, 2024

kaushikmitr commented Dec 9, 2024

kfswain commented Dec 9, 2024

kaushikmitr commented Dec 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaushikmitr commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaushikmitr Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong commented Dec 10, 2024

Choose a reason for hiding this comment

ahg-g commented Dec 10, 2024

k8s-ci-robot commented Dec 10, 2024

kaushikmitr commented Dec 8, 2024 •

edited

Loading

kaushikmitr Dec 10, 2024 •

edited

Loading