Incorrect training steps for distributed setting #137

theyorubayesian · 2024-06-30T03:03:07Z

During distributed training with Pytorch, the number of training steps increases with the number of processes.

To reproduce:

Transformers: 4.41.2
Torch: 2.3.1
Accelerate: 0.31.0

Distributing to 4 GPU devices trains for 500K steps.

torchrun --nproc_per_node=4 \
    -m tevatron.driver.train \
    --output_dir "$RUN_DIR" \
    --model_name_or_path "$MODEL_PATH" \
    --dataset_name "Tevatron/msmarco-passage" \
    --per_device_train_batch_size 32 \
    --num_train_epochs $NUM_EPOCHS \
    --dataloader_drop_last True \

Distributing to 2 GPU devices trains for 250K steps.

torchrun --nproc_per_node=2 \
    -m tevatron.driver.train \
    --output_dir "$RUN_DIR" \
    --model_name_or_path "$MODEL_PATH" \
    --dataset_name "Tevatron/msmarco-passage" \
    --per_device_train_batch_size 32 \
    --num_train_epochs $NUM_EPOCHS \
    --dataloader_drop_last True \

This happens because the dataloader is duplicated across the 4 GPUs instead of being sharded. HF moved sharding logic into the accelerator. The accelerator prepares the dataloader for the training configuration.

https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/trainer.py#L904

Here, the correct number of training steps should be 125K.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect training steps for distributed setting #137

Incorrect training steps for distributed setting #137

theyorubayesian commented Jun 30, 2024 •

edited

Loading

Incorrect training steps for distributed setting #137

Incorrect training steps for distributed setting #137

Comments

theyorubayesian commented Jun 30, 2024 • edited Loading

theyorubayesian commented Jun 30, 2024 •

edited

Loading