You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In b/394906793, customers are confused about why the llama throughput reported by us is different from what HF's metric report. When we set --include_tokens_per_second flag, HF will make report like:
, where num_token depends on the number of token of each batch in dataloader.
The difference is from two aspects:
when minibatch=False, HF will incorrectly compute the num_token. Below is the metric I captured that tells how much difference it is:
I am testing llama3 toy model with 2 hidden layer on v4-8 (4 device) using global batch size: 4, sequence length: 1024, max step = 5
- Enable minibatch:
* From the hf metric report we can see:
num_train_tokens: 20480 tokens
train_runtime: 54883ms
train_tokens_per_second = 20480/54.883=373.2
* From what we manually computed:
step time: 1/0.091=11step/sec
train_tokens_per_second: 4*1024/11=372
- Disable minibatch:
* From the hf metric report we can see:
num_train_tokens: 5120 tokens
train_runtime: 57089ms
train_tokens_per_second = 5120/57.089=89
* From what we manually computed:
step time: 1/0.086=11.627 step/sec
train_tokens_per_second: 4*1024/11.627=352
However, in most cases we already set minibatch=True.
As pointed by @bhavya01 , the profile we capture is usually from step 3, which doesn't counts the compilation time. However, for HF, it will count from step 0.
We need to make the metric report more transparent to users. This issue will be used to gather ideas about what the report should be looks like for torchprime.
The text was updated successfully, but these errors were encountered:
In b/394906793, customers are confused about why the llama throughput reported by us is different from what HF's metric report. When we set
--include_tokens_per_second
flag, HF will make report like:However, the
train_tokens_per_second
differs from what we reported in launchpad.When we compute throughtput/sec/chip, we use formula:
For HF, it uses ref:
, where
num_token
depends on the number of token of each batch in dataloader.The difference is from two aspects:
num_token
. Below is the metric I captured that tells how much difference it is:I am testing llama3 toy model with 2 hidden layer on v4-8 (4 device) using global batch size: 4, sequence length: 1024, max step = 5
However, in most cases we already set
minibatch=True
.We need to make the metric report more transparent to users. This issue will be used to gather ideas about what the report should be looks like for torchprime.
The text was updated successfully, but these errors were encountered: