I try to train our model with stylegan-2, find a bug, how I can fix it #3404

lingtengqiu · 2025-02-20T06:42:21Z

thanks for your great work.
When I apply accelerate to train our model with stylegan-2 discriminator, as the stylegan-2 tries to compile its cuda op, I get the following bugs.
Could you tell me how to fix it ?

Setting up PyTorch plugin "bias_act_plugin"... /usr/local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Done.
E0220 14:42:00.513000 139816943049600 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 60422) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

jeejaybee · 2025-02-23T23:25:30Z

I'm having a similar issue with stylegan3, with the "UserWarning: TORCH_CUDA_ARCH_LIST is not set" during training. I've tried specifying architecture in the training script like so:

TORCH_CUDA_ARCH_LIST="8.6;8.9" python train.py --outdir=training-runs
--cfg=stylegan3-t
--data=/data
--gpus=1
--batch=32
--mbstd-group=1
--gamma=5.0
--snap=100

and also editing the dockerfile to specify, but this didn't work either. Have you found a solution yet?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I try to train our model with stylegan-2, find a bug, how I can fix it #3404

I try to train our model with stylegan-2, find a bug, how I can fix it #3404

lingtengqiu commented Feb 20, 2025

jeejaybee commented Feb 23, 2025

I try to train our model with stylegan-2, find a bug, how I can fix it #3404

I try to train our model with stylegan-2, find a bug, how I can fix it #3404

Comments

lingtengqiu commented Feb 20, 2025

jeejaybee commented Feb 23, 2025