[2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference

Summary

The authors evaluated different multiplexing (time & space) techniques for ML inferences on GPUs and proposed ideas to achieve the best tradeoff across criterias.

Background & Motivation

Almost all cloud inference service providers/frameworks assign each model an exclusive GPU. This, combined with the small batch sizes used in an online setting, results in low hardware utilization. Current approaches that multiplex workloads have different tradeoffs, and there is no single solution that wins on all criteria.

Approach	Utilization	Performance (throughput/latency)	Predictability/Performance Isolation
Exclusive access	Poor	Good	Good
Time multiplexing (CUDA context switching)	Average	Poor	Good
Spatial multiplexing	Good	Average	Poor

Design & Implementation

The authors proposed software-level fusion of kernel operators across multiple inference jobs to get the best of all worlds.

Links & References

Paper PDF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2018-nips-dynamic-space-time-scheduling-for-gpu-inference.md

2018-nips-dynamic-space-time-scheduling-for-gpu-inference.md

[2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference

Summary

Background & Motivation

Design & Implementation

Links & References

Files

2018-nips-dynamic-space-time-scheduling-for-gpu-inference.md

Latest commit

History

2018-nips-dynamic-space-time-scheduling-for-gpu-inference.md

File metadata and controls

[2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference

Summary

Background & Motivation

Design & Implementation

Links & References