Add NVIDIA GPU runners and run CUDA tests again #18814

ScottTodd · 2024-10-17T16:10:07Z

We used to have some Linux runners with NVIDIA T4 and A100 GPUs on the GCP runner cluster. Our new Azure runner cluster currently only has Linux CPU runners.

We have unit tests and larger test suites using the CUDA and Vulkan HAL. None of these are particularly CPU heavy, so we could get by with a 4 or 8 core CPU and an attached GPU, if such a configuration is available.

Target should be presubmit (pull_request and push events), with a load of around 50-100 (up to 400) runs per day.

Can start with a single GPU type, but eventually we could also run tests across a wide range of data center and consumer cards nightly, like T4, A100, H100, 1080, 2080, etc. We will also want to run benchmarks eventually, which will need persistent runners with cached model weights and some tuned hardware/driver settings.

The text was updated successfully, but these errors were encountered:

ScottTodd added codegen/nvvm NVVM code generation compiler backend codegen/spirv SPIR-V code generation compiler backend hal/cuda Runtime CUDA HAL backend hal/vulkan Runtime Vulkan GPU HAL backend infrastructure Relating to build systems, CI, or testing labels Oct 17, 2024

ScottTodd assigned Eliasj42 Oct 17, 2024

ScottTodd mentioned this issue Oct 17, 2024

Add Apple GPU runners and run Metal tests again #18817

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVIDIA GPU runners and run CUDA tests again #18814

Add NVIDIA GPU runners and run CUDA tests again #18814

ScottTodd commented Oct 17, 2024

Add NVIDIA GPU runners and run CUDA tests again #18814

Add NVIDIA GPU runners and run CUDA tests again #18814

Comments

ScottTodd commented Oct 17, 2024