Randomization with seed not working #3398

nasosger · 2025-02-15T23:22:28Z

System Info

- `Accelerate` version: 0.32.1
- Platform: Linux-6.5.0-1022-aws-x86_64-with-glibc2.35
- Python version: 3.10.14
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1121.80 GB
- GPU type: NVIDIA A100-SXM4-40GB

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Basically, what I am trying to do, is run a finetuning script on multi-GPU setting (Distributed Data Parallel). My setup looks like this : https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_no_trainer.py#L311.

Let me provide a toy example to exhibit what is the problem.

import torch
from accelerate import Accelerator, DataLoaderConfiguration
from accelerate.utils import set_seed

set_seed(0)

accelerator = Accelerator()
dataset = range(10)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=True)
accelerator.print(type(dataloader))

dataloader = accelerator.prepare(dataloader)
accelerator.print(type(dataloader))

epochs=2

for epoch in range(epochs):
    accelerator.wait_for_everyone()
    accelerator.print(f"==============================================")
    accelerator.print(f"Epoch {epoch}")
    
    accelerator.wait_for_everyone()
    for idx, batch in enumerate(dataloader):
        if idx == 0:
            print(f'Batch {idx}: {batch}')

Expected behavior

If i run this script with python (one process), setting the random seed ensures reproducibility.
Changing the seed ensures that I have different shuffling of the data.

But, if I run the script with accelerate launch (multi-process), the dataloader shuffles the data in the same way every time, regardless of the random seed. For example, by changing the random seed, the first batch remains the exact same, and the data gets shuffled in the same way.
This is a bit strange to me, as the set_seed function (transformers' or accelerate's) achieves reproducibility in my one-process experiments. But I wanted to run some experiments on a distributed scenario (Multi-GPU), and I noticed this.

You can run the script that i provided with multiple seeds (or even no seed if you comment the set_seed line), and the fetched batches will be the exact same.
I also tried to include

dataloader_config=DataLoaderConfiguration(
        use_seedable_sampler=True,
    )

in the Accelerator initialization. by doing this, the random seeds worked for the first epoch (I mean that when I changed the seed, the data was fetched in different ways). BUT, another problem occured: the shuffling before each epoch did not work. So this was not a proper fix (?).

Please let me know if there is any proper way of fixing this, to ensure reproducible experiments with multiple seeds. I cannot find any sufficient answer on this, it seems like a bug to me.

The text was updated successfully, but these errors were encountered:

nasosger · 2025-02-20T11:14:42Z

I am sorry to tag you guys, but please, can you take a quick look at this? I can't figure out what's the problem, and I suppose others may face it too.
@sgugger @muellerzr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomization with seed not working #3398

Randomization with seed not working #3398

nasosger commented Feb 15, 2025

nasosger commented Feb 20, 2025

Randomization with seed not working #3398

Randomization with seed not working #3398

Comments

nasosger commented Feb 15, 2025

System Info

Information

Tasks

Reproduction

Expected behavior

nasosger commented Feb 20, 2025