Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I try to train our model with stylegan-2, find a bug, how I can fix it #3404

Open
lingtengqiu opened this issue Feb 20, 2025 · 1 comment
Open

Comments

@lingtengqiu
Copy link

thanks for your great work.
When I apply accelerate to train our model with stylegan-2 discriminator, as the stylegan-2 tries to compile its cuda op, I get the following bugs.
Could you tell me how to fix it ?

Setting up PyTorch plugin "bias_act_plugin"... /usr/local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Done.
E0220 14:42:00.513000 139816943049600 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 60422) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
@jeejaybee
Copy link

I'm having a similar issue with stylegan3, with the "UserWarning: TORCH_CUDA_ARCH_LIST is not set" during training. I've tried specifying architecture in the training script like so:

TORCH_CUDA_ARCH_LIST="8.6;8.9" python train.py --outdir=training-runs
--cfg=stylegan3-t
--data=/data
--gpus=1
--batch=32
--mbstd-group=1
--gamma=5.0
--snap=100

and also editing the dockerfile to specify, but this didn't work either. Have you found a solution yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants