You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
If i run this script with python (one process), setting the random seed ensures reproducibility.
Changing the seed ensures that I have different shuffling of the data.
But, if I run the script with accelerate launch (multi-process), the dataloader shuffles the data in the same way every time, regardless of the random seed. For example, by changing the random seed, the first batch remains the exact same, and the data gets shuffled in the same way.
This is a bit strange to me, as the set_seed function (transformers' or accelerate's) achieves reproducibility in my one-process experiments. But I wanted to run some experiments on a distributed scenario (Multi-GPU), and I noticed this.
You can run the script that i provided with multiple seeds (or even no seed if you comment the set_seed line), and the fetched batches will be the exact same.
I also tried to include
in the Accelerator initialization. by doing this, the random seeds worked for the first epoch (I mean that when I changed the seed, the data was fetched in different ways). BUT, another problem occured: the shuffling before each epoch did not work. So this was not a proper fix (?).
Please let me know if there is any proper way of fixing this, to ensure reproducible experiments with multiple seeds. I cannot find any sufficient answer on this, it seems like a bug to me.
The text was updated successfully, but these errors were encountered:
I am sorry to tag you guys, but please, can you take a quick look at this? I can't figure out what's the problem, and I suppose others may face it too. @sgugger@muellerzr
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Basically, what I am trying to do, is run a finetuning script on multi-GPU setting (Distributed Data Parallel). My setup looks like this : https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_no_trainer.py#L311.
Let me provide a toy example to exhibit what is the problem.
Expected behavior
If i run this script with python (one process), setting the random seed ensures reproducibility.
Changing the seed ensures that I have different shuffling of the data.
But, if I run the script with accelerate launch (multi-process), the dataloader shuffles the data in the same way every time, regardless of the random seed. For example, by changing the random seed, the first batch remains the exact same, and the data gets shuffled in the same way.
This is a bit strange to me, as the set_seed function (transformers' or accelerate's) achieves reproducibility in my one-process experiments. But I wanted to run some experiments on a distributed scenario (Multi-GPU), and I noticed this.
You can run the script that i provided with multiple seeds (or even no seed if you comment the set_seed line), and the fetched batches will be the exact same.
I also tried to include
in the Accelerator initialization. by doing this, the random seeds worked for the first epoch (I mean that when I changed the seed, the data was fetched in different ways). BUT, another problem occured: the shuffling before each epoch did not work. So this was not a proper fix (?).
Please let me know if there is any proper way of fixing this, to ensure reproducible experiments with multiple seeds. I cannot find any sufficient answer on this, it seems like a bug to me.
The text was updated successfully, but these errors were encountered: