Calculate step duration and MFU after a training run is done #67

tengyifei · 2025-01-31T03:44:55Z

No description provided.

Precursor to #67. Imports https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/utils/profile_convert.py and improves it. Specifically, I noticed sometimes there is an empty gap between two step markers in the profile. So if we averaged event durations, that would overestimate the MFU. Instead, this now averages the delta between the starting time offset of neighboring events. Now that we can print step time from the profile, I removed the step time from the training loop. That added a bunch of delays and is actually pretty inaccurate (1.7s vs 1.85s in local testing). Tested: XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py mesh.fsdp=8 profile_step=4 model=llama-3-8b XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py mesh.fsdp=8 profile_step=4 model=mixtral-8x7b

tengyifei self-assigned this Jan 31, 2025

tengyifei mentioned this issue Jan 31, 2025

Script to compute step duration #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate step duration and MFU after a training run is done #67

Calculate step duration and MFU after a training run is done #67

tengyifei commented Jan 31, 2025

Calculate step duration and MFU after a training run is done #67

Calculate step duration and MFU after a training run is done #67

Comments

tengyifei commented Jan 31, 2025