Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf regression from updating torch_xla from 2.6 stable to Feb 17 nightly #119

Open
tengyifei opened this issue Feb 21, 2025 · 0 comments
Open

Comments

@tengyifei
Copy link
Collaborator

tengyifei commented Feb 21, 2025

Stable: https://github.com/AI-Hypercomputer/torchprime/actions/runs/13452568193/job/37590057613
Step time: 3.002 s

Feb 17 nightly: https://github.com/AI-Hypercomputer/torchprime/actions/runs/13465790777/job/37631596736
Step time: 3.499 s

Similar regression in Mixtral.

We need to track down the performance regression. Since this affects 2 models on v6e-4, it probably affects the larger models too. We may need to do some bisection to find the root cause. Ideally, we'd build a tool to e.g. automatically file a bunch of PRs to test the performance.

However, we also cannot stay on r2.6 forever. We have to bump torch-xla to pick up new changes for scan support.

Down the road it's probably a good idea to run torchprime v6e-4 E2E tests from torch_xla PRs to catch these regressions early. The E2E test only take 30 minutes so it's faster than the TPU CI test (>1hr).

cc @zpcore @bhavya01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant