Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomization with seed not working #3398

Open
2 of 4 tasks
nasosger opened this issue Feb 15, 2025 · 1 comment
Open
2 of 4 tasks

Randomization with seed not working #3398

nasosger opened this issue Feb 15, 2025 · 1 comment

Comments

@nasosger
Copy link

System Info

- `Accelerate` version: 0.32.1
- Platform: Linux-6.5.0-1022-aws-x86_64-with-glibc2.35
- Python version: 3.10.14
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1121.80 GB
- GPU type: NVIDIA A100-SXM4-40GB

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Basically, what I am trying to do, is run a finetuning script on multi-GPU setting (Distributed Data Parallel). My setup looks like this : https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_no_trainer.py#L311.

Let me provide a toy example to exhibit what is the problem.

import torch
from accelerate import Accelerator, DataLoaderConfiguration
from accelerate.utils import set_seed

set_seed(0)

accelerator = Accelerator()
dataset = range(10)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=True)
accelerator.print(type(dataloader))

dataloader = accelerator.prepare(dataloader)
accelerator.print(type(dataloader))

epochs=2

for epoch in range(epochs):
    accelerator.wait_for_everyone()
    accelerator.print(f"==============================================")
    accelerator.print(f"Epoch {epoch}")
    
    accelerator.wait_for_everyone()
    for idx, batch in enumerate(dataloader):
        if idx == 0:
            print(f'Batch {idx}: {batch}')
        

Expected behavior

If i run this script with python (one process), setting the random seed ensures reproducibility.
Changing the seed ensures that I have different shuffling of the data.

But, if I run the script with accelerate launch (multi-process), the dataloader shuffles the data in the same way every time, regardless of the random seed. For example, by changing the random seed, the first batch remains the exact same, and the data gets shuffled in the same way.
This is a bit strange to me, as the set_seed function (transformers' or accelerate's) achieves reproducibility in my one-process experiments. But I wanted to run some experiments on a distributed scenario (Multi-GPU), and I noticed this.

You can run the script that i provided with multiple seeds (or even no seed if you comment the set_seed line), and the fetched batches will be the exact same.
I also tried to include

dataloader_config=DataLoaderConfiguration(
        use_seedable_sampler=True,
    )

in the Accelerator initialization. by doing this, the random seeds worked for the first epoch (I mean that when I changed the seed, the data was fetched in different ways). BUT, another problem occured: the shuffling before each epoch did not work. So this was not a proper fix (?).

Please let me know if there is any proper way of fixing this, to ensure reproducible experiments with multiple seeds. I cannot find any sufficient answer on this, it seems like a bug to me.

@nasosger
Copy link
Author

I am sorry to tag you guys, but please, can you take a quick look at this? I can't figure out what's the problem, and I suppose others may face it too.
@sgugger @muellerzr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant