Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance Report] What metrics to report? #96

Open
zpcore opened this issue Feb 7, 2025 · 0 comments
Open

[Performance Report] What metrics to report? #96

zpcore opened this issue Feb 7, 2025 · 0 comments

Comments

@zpcore
Copy link
Collaborator

zpcore commented Feb 7, 2025

In b/394906793, customers are confused about why the llama throughput reported by us is different from what HF's metric report. When we set --include_tokens_per_second flag, HF will make report like:

***** train metrics *****
  epoch                    =     0.0003
  total_flos               =   828013GF
  train_loss               =    12.1873
  train_runtime            = 0:00:06.43
  train_samples            =     118132
  train_samples_per_second =      6.213
  train_steps_per_second   =      0.777
  train_tokens_per_second  =   6361.818

However, the train_tokens_per_second differs from what we reported in launchpad.

When we compute throughtput/sec/chip, we use formula:

global_batch_size*seq_len*1000/avg_step_time_ms/chip_count

For HF, it uses ref:

num_token*max_steps*gradient_accumulation_steps

, where num_token depends on the number of token of each batch in dataloader.

The difference is from two aspects:

  1. when minibatch=False, HF will incorrectly compute the num_token. Below is the metric I captured that tells how much difference it is:
    I am testing llama3 toy model with 2 hidden layer on v4-8 (4 device) using global batch size: 4, sequence length: 1024, max step = 5
- Enable minibatch:
* From the hf metric report we can see:
num_train_tokens: 20480 tokens
train_runtime: 54883ms
train_tokens_per_second = 20480/54.883=373.2
* From what we manually computed: 
step time: 1/0.091=11step/sec
train_tokens_per_second: 4*1024/11=372

- Disable minibatch:
* From the hf metric report we can see:
num_train_tokens: 5120 tokens
train_runtime: 57089ms
train_tokens_per_second = 5120/57.089=89
* From what we manually computed: 
step time: 1/0.086=11.627 step/sec
train_tokens_per_second: 4*1024/11.627=352

However, in most cases we already set minibatch=True.

  1. As pointed by @bhavya01 , the profile we capture is usually from step 3, which doesn't counts the compilation time. However, for HF, it will count from step 0.

We need to make the metric report more transparent to users. This issue will be used to gather ideas about what the report should be looks like for torchprime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant