Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates #78

Merged
merged 17 commits into from
Dec 10, 2024

Conversation

kaushikmitr
Copy link
Contributor

@kaushikmitr kaushikmitr commented Dec 8, 2024

This pull request includes updates to the documentation and significant enhancements to the request scheduling and handling logic in the pkg/ext-proc package. The most important changes involve refining the handling of response headers, improving the scheduling algorithms for better load balancing and resource utilization, and updating the README file to include new scheduling flowchart.

Documentation Updates:

  • pkg/README.md: Updated the deployment steps to include new configurations for LLM Service and LLMServerPool, and restructured the steps for deploying the gateway and ext-proc. Added a new section on the scheduling package and included a flowchart for the scheduling algorithm. [1] [2] [3]

Response Handling Enhancements:

  • pkg/ext-proc/handlers/response.go: Modified the HandleResponseHeaders method to include additional headers when a target pod is specified, and added a fallback response for cases where no target pod is provided. [1] [2]

Scheduling Algorithm Improvements:

  • pkg/ext-proc/scheduling/filter.go: Added new filter functions lowQueuingFilterFunc, loRAAffinityPredicate, and minLoRAPredicate to enhance the scheduling logic for better load balancing and resource utilization. [1] [2]
  • pkg/ext-proc/scheduling/scheduler.go: Introduced new queueing thresholds and updated the scheduling filters to incorporate the new filter functions. Added lowLatencyFilterModified to dynamically adjust scheduling based on queueing thresholds. [1] [2] [3]

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 8, 2024
@liu-cong
Copy link
Contributor

liu-cong commented Dec 9, 2024

It'll be easier to split the algo change and the user guide change (I think they are not coupled). And we can get the user guide update checked in much faster.

Did you run some benchmarks to see how the updated algo improve things?

@kaushikmitr kaushikmitr changed the title Enhancements to LLM Instance Gateway: Scheduling Logic, Manifests, and Documentation Updates Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates Dec 9, 2024
@kaushikmitr
Copy link
Contributor Author

This PR only has the filter related changes. The manifest updates are in a separate PR: #81

@kaushikmitr
Copy link
Contributor Author

Yes, we have new benchmarks based on the updated scheduling logic shared internally

@kfswain
Copy link
Contributor

kfswain commented Dec 9, 2024

Yes, we have new benchmarks based on the updated scheduling logic shared internally

Can we share something external? This is an OSS repo, and anyone in the future that wants to follow the story of this PR should be able to have all the context

@kaushikmitr
Copy link
Contributor Author

Yes, we have new benchmarks based on the updated scheduling logic shared internally

Can we share something external? This is an OSS repo, and anyone in the future that wants to follow the story of this PR should be able to have all the context

Yes, I absolutely agree. We should definitely have an external version of our benchmarking docs accompanying this PR, and the OSS repo in general. Lets discuss more.

pkg/ext-proc/scheduling/filter.go Outdated Show resolved Hide resolved
pkg/ext-proc/scheduling/scheduler.go Show resolved Hide resolved
pkg/ext-proc/scheduling/filter.go Outdated Show resolved Hide resolved
nextOnFailure: &filter{
name: "min cost LoRA",
filter: toFilterFunc(minLoRAPredicate),
nextOnSuccessOrFailure: lowLatencyFilterNoLoRA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what happens if you just use the lowLatencyFilterLoRA filter. If that works well, then we don't need the lowLatencyFilterNoLoRA. It will make the code much cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowLatencyFilterLoRA is needed when we first prioritze lora affinity and followed by lowLatencyFilterNoLoRA (queueing + least KV Cache). We can probably reuse lowLatencyFilterLoRA but it would be very confusing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought, I was also expecting that we will reuse lowLatencyFilterLoRA, and so this filter will look like:

lowLatencyFilterModified = &filter{
  name:   "low queueing filter",
  filter: toFilterFunc((lowQueueingPodPredicate)),
  nextOnSuccessOrFailure: lowLatencyFilterLoRA
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so lowLatencyFilterLoRA prioritized queuing over LoRA Affinity (this is how we had it originally) i.e. Least Queueing -> Min Cost LoRA -> Least KV Cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I just noticed that we flip the order here compared to lowLatencyFilterLoRA

Here we do: LoRA -> queue length -> kv-cache
There we do: queue length -> LoRA -> kv-cache

is this by design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so i realized the names are very confusing. So I renamed the filters. Yes its by design to make LoRA Affinity stronger.

examples/poc/manifests/vllm/vllm-lora-deployment.yaml Outdated Show resolved Hide resolved
@kaushikmitr
Copy link
Contributor Author

Yes, we have new benchmarks based on the updated scheduling logic shared internally

Can we share something external? This is an OSS repo, and anyone in the future that wants to follow the story of this PR should be able to have all the context

Yes, I absolutely agree. We should definitely have an external version of our benchmarking docs accompanying this PR, and the OSS repo in general. Lets discuss more.

@kfswain @ahg-g created this issue to create external benchmarking doc: #88

pkg/README.md Outdated

1. **Update Envoy Gateway Config to enable Patch Policy**
2. **Deploy LLM Service and LLMServerPool**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you keep them as "1.", then they will automatically be set as a sequence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert to the "1." format so that the list numbers are automatically set.

Key: "x-went-into-resp-headers",
RawValue: []byte("true"),
var resp *extProcPb.ProcessingResponse
if reqCtx.TargetPod != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to set the targetPod in the response header?

also, if we must, then you can do this:

headers := []*configPb.HeaderValueOption{
	{
		Header: &configPb.HeaderValue{
			// This is for debugging purpose only.
			Key:      "x-went-into-resp-headers",
			RawValue: []byte("true"),
		},
	},
}

if reqCtx.TargetPod != nil {
	headers = append(headers, &configPb.HeaderValueOption{
		Header: &configPb.HeaderValue{
			Key:      "x-target-pod",
			RawValue: []byte(targetpod.Name),
		},
	})
}

resp = &extProcPb.ProcessingResponse{
  .....
  SetHeaders: headers
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely for debug purpose, not really needed. I thought it might be useful to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you refactor the code as suggested above please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me remove it for now this change is unrelated to the main goal of this PR.

@@ -121,6 +121,11 @@ func leastQueuingFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*bac
return filtered, nil
}

// loRAAffinityPredicate is a filter function to check whether a pod has affinity to the lora requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment

nextOnFailure: sheddableRequestFilter,
}

// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower
// cost to load an adapter and has low KV cache, which typically yields lower latency.
lowLatencyFilter = &filter{
lowLatencyFilterLoRA = &filter{
name: "least queuing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the filter names to be more descriptive.

nextOnFailure: &filter{
name: "min cost LoRA",
filter: toFilterFunc(minLoRAPredicate),
nextOnSuccessOrFailure: lowLatencyFilterNoLoRA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought, I was also expecting that we will reuse lowLatencyFilterLoRA, and so this filter will look like:

lowLatencyFilterModified = &filter{
  name:   "low queueing filter",
  filter: toFilterFunc((lowQueueingPodPredicate)),
  nextOnSuccessOrFailure: lowLatencyFilterLoRA
}

@@ -29,7 +33,7 @@ var (

// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 10, 2024
filter: toFilterFunc((lowQueueingPodPredicate)),
nextOnSuccess: &filter{
name: "affinity LoRA",
filter: toFilterFunc(loRAAffinityPredicate),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use lowLoRACostPredicate with nextOnSuccessOrFailure: queueAndKVCacheFilter instead of doing loRAAffinityPredicate and canAcceptNewLoraPredicate separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowLoRACostPredicate picks both pods with canAcceptNewLoraPredicate and loRAAffinityPredicate, For stronger affinity we want to pick only pods with loRAAffinityPredicate and if no such pod is present only then pick canAcceptNewLoraPredicate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do that for the other branch too then?

Copy link
Contributor Author

@kaushikmitr kaushikmitr Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lowLoRACostPredicate ensures weak affinity by spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod. This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters. loRAAffinityPredicate on the other hand ensures strong affinity i.e it pins requests to a single pod with that adapter. Depending on the scenario one or the other might be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this reasoning please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to lowLoRACostPredicate with the reasoning, like we have in leastKVCacheFilterFunc.

@ahg-g
Copy link
Contributor

ahg-g commented Dec 10, 2024

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 10, 2024
nextOnFailure: &filter{
name: "can accept LoRA Adapter",
filter: toFilterFunc(canAcceptNewLoraPredicate),
nextOnSuccessOrFailure: queueAndKVCacheFilter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we replace queueAndKVCacheFilter here and 4 lines above with queueLoRAAndKVCacheFilter, the effect should be the same? queueLoRAAndKVCacheFilter will add the lowCostLoRA filter in between, but given the pods are already filtered by lora affinity, it should be a noop.

This will simplify the code, however at the cost of potentially more confusion with the noop step. It's up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but also think this will make it more confusing, also i think queueAndKVCacheFilter is something we might need in future. For example, when we the request does not have need a lora adapter we can directly apply queueAndKVCacheFilter instead of checking for lora affinity.

@liu-cong
Copy link
Contributor

/lgtm

Thanks for the deep dives and improving the algo! Unfortunately the decision tree is getting more complex and hopefully we can find ways to simply it in the future.

One potential is to at least simplify the non-LoRA mode. We can add a flag to indicate whether LoRA is enabled or not. Another reason to do this is that vLLM doesn't expose LoRA metrics when LoRA is not enabled, which resulted in noisy error logs.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2024
// model server has room to load the adapter
// model server has room to load the adapter. The lowLoRACostPredicate ensures weak affinity by spreading the
// load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod.
// This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters.
func lowLoRACostPredicate(req *LLMRequest, pod *backend.PodMetrics) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this comment here but this doesn't need to be addressed in this PR.

We can potentially refactor this predicate to prefer the affinity first, then fall back to canAcceptNewLoRA if no affinity is found. In this case we should be able to consolidate much of the different decision tree branches. Will of course need some benchmark to see the impact.

@ahg-g
Copy link
Contributor

ahg-g commented Dec 10, 2024

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit 5372efb into kubernetes-sigs:main Dec 10, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants