Perf regression from updating torch_xla from 2.6 stable to Feb 17 nightly #119

tengyifei · 2025-02-21T22:40:47Z

Stable: https://github.com/AI-Hypercomputer/torchprime/actions/runs/13452568193/job/37590057613
Step time: 3.002 s

Feb 17 nightly: https://github.com/AI-Hypercomputer/torchprime/actions/runs/13465790777/job/37631596736
Step time: 3.499 s

Similar regression in Mixtral.

We need to track down the performance regression. Since this affects 2 models on v6e-4, it probably affects the larger models too. We may need to do some bisection to find the root cause. Ideally, we'd build a tool to e.g. automatically file a bunch of PRs to test the performance.

However, we also cannot stay on r2.6 forever. We have to bump torch-xla to pick up new changes for scan support.

Down the road it's probably a good idea to run torchprime v6e-4 E2E tests from torch_xla PRs to catch these regressions early. The E2E test only take 30 minutes so it's faster than the TPU CI test (>1hr).

cc @zpcore @bhavya01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf regression from updating torch_xla from 2.6 stable to Feb 17 nightly #119

Perf regression from updating torch_xla from 2.6 stable to Feb 17 nightly #119

tengyifei commented Feb 21, 2025 •

edited

Loading

Perf regression from updating torch_xla from 2.6 stable to Feb 17 nightly #119

Perf regression from updating torch_xla from 2.6 stable to Feb 17 nightly #119

Comments

tengyifei commented Feb 21, 2025 • edited Loading

tengyifei commented Feb 21, 2025 •

edited

Loading