You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to track down the performance regression. Since this affects 2 models on v6e-4, it probably affects the larger models too. We may need to do some bisection to find the root cause. Ideally, we'd build a tool to e.g. automatically file a bunch of PRs to test the performance.
However, we also cannot stay on r2.6 forever. We have to bump torch-xla to pick up new changes for scan support.
Down the road it's probably a good idea to run torchprime v6e-4 E2E tests from torch_xla PRs to catch these regressions early. The E2E test only take 30 minutes so it's faster than the TPU CI test (>1hr).
Stable: https://github.com/AI-Hypercomputer/torchprime/actions/runs/13452568193/job/37590057613
Step time: 3.002 s
Feb 17 nightly: https://github.com/AI-Hypercomputer/torchprime/actions/runs/13465790777/job/37631596736
Step time: 3.499 s
Similar regression in Mixtral.
We need to track down the performance regression. Since this affects 2 models on v6e-4, it probably affects the larger models too. We may need to do some bisection to find the root cause. Ideally, we'd build a tool to e.g. automatically file a bunch of PRs to test the performance.
However, we also cannot stay on r2.6 forever. We have to bump torch-xla to pick up new changes for scan support.
Down the road it's probably a good idea to run torchprime v6e-4 E2E tests from torch_xla PRs to catch these regressions early. The E2E test only take 30 minutes so it's faster than the TPU CI test (>1hr).
cc @zpcore @bhavya01
The text was updated successfully, but these errors were encountered: