-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates #78
Conversation
It'll be easier to split the algo change and the user guide change (I think they are not coupled). And we can get the user guide update checked in much faster. Did you run some benchmarks to see how the updated algo improve things? |
This PR only has the filter related changes. The manifest updates are in a separate PR: #81 |
Yes, we have new benchmarks based on the updated scheduling logic shared internally |
Can we share something external? This is an OSS repo, and anyone in the future that wants to follow the story of this PR should be able to have all the context |
Yes, I absolutely agree. We should definitely have an external version of our benchmarking docs accompanying this PR, and the OSS repo in general. Lets discuss more. |
pkg/ext-proc/scheduling/scheduler.go
Outdated
nextOnFailure: &filter{ | ||
name: "min cost LoRA", | ||
filter: toFilterFunc(minLoRAPredicate), | ||
nextOnSuccessOrFailure: lowLatencyFilterNoLoRA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what happens if you just use the lowLatencyFilterLoRA filter. If that works well, then we don't need the lowLatencyFilterNoLoRA. It will make the code much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowLatencyFilterLoRA is needed when we first prioritze lora affinity and followed by lowLatencyFilterNoLoRA (queueing + least KV Cache). We can probably reuse lowLatencyFilterLoRA but it would be very confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought, I was also expecting that we will reuse lowLatencyFilterLoRA
, and so this filter will look like:
lowLatencyFilterModified = &filter{
name: "low queueing filter",
filter: toFilterFunc((lowQueueingPodPredicate)),
nextOnSuccessOrFailure: lowLatencyFilterLoRA
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so lowLatencyFilterLoRA prioritized queuing over LoRA Affinity (this is how we had it originally) i.e. Least Queueing -> Min Cost LoRA -> Least KV Cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I just noticed that we flip the order here compared to lowLatencyFilterLoRA
Here we do: LoRA -> queue length -> kv-cache
There we do: queue length -> LoRA -> kv-cache
is this by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, so i realized the names are very confusing. So I renamed the filters. Yes its by design to make LoRA Affinity stronger.
@kfswain @ahg-g created this issue to create external benchmarking doc: #88 |
pkg/README.md
Outdated
|
||
1. **Update Envoy Gateway Config to enable Patch Policy** | ||
2. **Deploy LLM Service and LLMServerPool** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you keep them as "1.", then they will automatically be set as a sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert to the "1." format so that the list numbers are automatically set.
pkg/ext-proc/handlers/response.go
Outdated
Key: "x-went-into-resp-headers", | ||
RawValue: []byte("true"), | ||
var resp *extProcPb.ProcessingResponse | ||
if reqCtx.TargetPod != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to set the targetPod in the response header?
also, if we must, then you can do this:
headers := []*configPb.HeaderValueOption{
{
Header: &configPb.HeaderValue{
// This is for debugging purpose only.
Key: "x-went-into-resp-headers",
RawValue: []byte("true"),
},
},
}
if reqCtx.TargetPod != nil {
headers = append(headers, &configPb.HeaderValueOption{
Header: &configPb.HeaderValue{
Key: "x-target-pod",
RawValue: []byte(targetpod.Name),
},
})
}
resp = &extProcPb.ProcessingResponse{
.....
SetHeaders: headers
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely for debug purpose, not really needed. I thought it might be useful to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, can you refactor the code as suggested above please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me remove it for now this change is unrelated to the main goal of this PR.
pkg/ext-proc/scheduling/filter.go
Outdated
@@ -121,6 +121,11 @@ func leastQueuingFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*bac | |||
return filtered, nil | |||
} | |||
|
|||
// loRAAffinityPredicate is a filter function to check whether a pod has affinity to the lora requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment
nextOnFailure: sheddableRequestFilter, | ||
} | ||
|
||
// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower | ||
// cost to load an adapter and has low KV cache, which typically yields lower latency. | ||
lowLatencyFilter = &filter{ | ||
lowLatencyFilterLoRA = &filter{ | ||
name: "least queuing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the filter names to be more descriptive.
pkg/ext-proc/scheduling/scheduler.go
Outdated
nextOnFailure: &filter{ | ||
name: "min cost LoRA", | ||
filter: toFilterFunc(minLoRAPredicate), | ||
nextOnSuccessOrFailure: lowLatencyFilterNoLoRA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought, I was also expecting that we will reuse lowLatencyFilterLoRA
, and so this filter will look like:
lowLatencyFilterModified = &filter{
name: "low queueing filter",
filter: toFilterFunc((lowQueueingPodPredicate)),
nextOnSuccessOrFailure: lowLatencyFilterLoRA
}
pkg/ext-proc/scheduling/scheduler.go
Outdated
@@ -29,7 +33,7 @@ var ( | |||
|
|||
// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment
filter: toFilterFunc((lowQueueingPodPredicate)), | ||
nextOnSuccess: &filter{ | ||
name: "affinity LoRA", | ||
filter: toFilterFunc(loRAAffinityPredicate), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use lowLoRACostPredicate
with nextOnSuccessOrFailure: queueAndKVCacheFilter
instead of doing loRAAffinityPredicate
and canAcceptNewLoraPredicate
separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowLoRACostPredicate picks both pods with canAcceptNewLoraPredicate and loRAAffinityPredicate, For stronger affinity we want to pick only pods with loRAAffinityPredicate and if no such pod is present only then pick canAcceptNewLoraPredicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do that for the other branch too then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lowLoRACostPredicate ensures weak affinity by spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod. This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters. loRAAffinityPredicate on the other hand ensures strong affinity i.e it pins requests to a single pod with that adapter. Depending on the scenario one or the other might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this reasoning please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to lowLoRACostPredicate with the reasoning, like we have in leastKVCacheFilterFunc.
/label tide/merge-method-squash |
nextOnFailure: &filter{ | ||
name: "can accept LoRA Adapter", | ||
filter: toFilterFunc(canAcceptNewLoraPredicate), | ||
nextOnSuccessOrFailure: queueAndKVCacheFilter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we replace queueAndKVCacheFilter
here and 4 lines above with queueLoRAAndKVCacheFilter
, the effect should be the same? queueLoRAAndKVCacheFilter
will add the lowCostLoRA
filter in between, but given the pods are already filtered by lora affinity, it should be a noop.
This will simplify the code, however at the cost of potentially more confusion with the noop step. It's up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but also think this will make it more confusing, also i think queueAndKVCacheFilter is something we might need in future. For example, when we the request does not have need a lora adapter we can directly apply queueAndKVCacheFilter instead of checking for lora affinity.
/lgtm Thanks for the deep dives and improving the algo! Unfortunately the decision tree is getting more complex and hopefully we can find ways to simply it in the future. One potential is to at least simplify the non-LoRA mode. We can add a flag to indicate whether LoRA is enabled or not. Another reason to do this is that vLLM doesn't expose LoRA metrics when LoRA is not enabled, which resulted in noisy error logs. |
// model server has room to load the adapter | ||
// model server has room to load the adapter. The lowLoRACostPredicate ensures weak affinity by spreading the | ||
// load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod. | ||
// This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters. | ||
func lowLoRACostPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave this comment here but this doesn't need to be addressed in this PR.
We can potentially refactor this predicate to prefer the affinity first, then fall back to canAcceptNewLoRA if no affinity is found. In this case we should be able to consolidate much of the different decision tree branches. Will of course need some benchmark to see the impact.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, kaushikmitr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This pull request includes updates to the documentation and significant enhancements to the request scheduling and handling logic in the
pkg/ext-proc
package. The most important changes involve refining the handling of response headers, improving the scheduling algorithms for better load balancing and resource utilization, and updating the README file to include new scheduling flowchart.Documentation Updates:
pkg/README.md
: Updated the deployment steps to include new configurations for LLM Service and LLMServerPool, and restructured the steps for deploying the gateway and ext-proc. Added a new section on the scheduling package and included a flowchart for the scheduling algorithm. [1] [2] [3]Response Handling Enhancements:
pkg/ext-proc/handlers/response.go
: Modified theHandleResponseHeaders
method to include additional headers when a target pod is specified, and added a fallback response for cases where no target pod is provided. [1] [2]Scheduling Algorithm Improvements:
pkg/ext-proc/scheduling/filter.go
: Added new filter functionslowQueuingFilterFunc
,loRAAffinityPredicate
, andminLoRAPredicate
to enhance the scheduling logic for better load balancing and resource utilization. [1] [2]pkg/ext-proc/scheduling/scheduler.go
: Introduced new queueing thresholds and updated the scheduling filters to incorporate the new filter functions. AddedlowLatencyFilterModified
to dynamically adjust scheduling based on queueing thresholds. [1] [2] [3]