[Performance Report] What metrics to report? #96

zpcore · 2025-02-07T18:31:11Z

In b/394906793, customers are confused about why the llama throughput reported by us is different from what HF's metric report. When we set --include_tokens_per_second flag, HF will make report like:

***** train metrics *****
  epoch                    =     0.0003
  total_flos               =   828013GF
  train_loss               =    12.1873
  train_runtime            = 0:00:06.43
  train_samples            =     118132
  train_samples_per_second =      6.213
  train_steps_per_second   =      0.777
  train_tokens_per_second  =   6361.818

However, the train_tokens_per_second differs from what we reported in launchpad.

When we compute throughtput/sec/chip, we use formula:

global_batch_size*seq_len*1000/avg_step_time_ms/chip_count

For HF, it uses ref:

num_token*max_steps*gradient_accumulation_steps

, where num_token depends on the number of token of each batch in dataloader.

The difference is from two aspects:

when minibatch=False, HF will incorrectly compute the num_token. Below is the metric I captured that tells how much difference it is:
I am testing llama3 toy model with 2 hidden layer on v4-8 (4 device) using global batch size: 4, sequence length: 1024, max step = 5

- Enable minibatch:
* From the hf metric report we can see:
num_train_tokens: 20480 tokens
train_runtime: 54883ms
train_tokens_per_second = 20480/54.883=373.2
* From what we manually computed: 
step time: 1/0.091=11step/sec
train_tokens_per_second: 4*1024/11=372

- Disable minibatch:
* From the hf metric report we can see:
num_train_tokens: 5120 tokens
train_runtime: 57089ms
train_tokens_per_second = 5120/57.089=89
* From what we manually computed: 
step time: 1/0.086=11.627 step/sec
train_tokens_per_second: 4*1024/11.627=352

However, in most cases we already set minibatch=True.

As pointed by @bhavya01 , the profile we capture is usually from step 3, which doesn't counts the compilation time. However, for HF, it will count from step 0.

We need to make the metric report more transparent to users. This issue will be used to gather ideas about what the report should be looks like for torchprime.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Report] What metrics to report? #96

[Performance Report] What metrics to report? #96

zpcore commented Feb 7, 2025

[Performance Report] What metrics to report? #96

[Performance Report] What metrics to report? #96

Comments

zpcore commented Feb 7, 2025