Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model not getting trained on single GPU #32

Open
aryanmangal769 opened this issue Aug 28, 2023 · 16 comments
Open

Model not getting trained on single GPU #32

aryanmangal769 opened this issue Aug 28, 2023 · 16 comments

Comments

@aryanmangal769
Copy link

When I try to train on single GPU, the error keeps on increasing and I cannot see any good results even thill 38th epoch.

train_class_error starts from 97.88 and deom 19th to 37th epoch its consistently 100. Can you debug this?

Please let me know if you need some more information

@yrcong
Copy link
Owner

yrcong commented Oct 20, 2023

We train the model for 150 epochs. 38th epoch might be just warm-up. Maybe you can try to load some pretrained weights to accelerate the training?

@qql-Kenneth
Copy link

@aryanmangal769 bro! How to train the model on One GPU?

@qql-Kenneth
Copy link

@me I add os.environ['MASTER_PORT'] = '8889' in main.py

@yrcong
Copy link
Owner

yrcong commented Apr 22, 2024

It is not related to the port. Make --nproc_per_node=1 pls

@AlphaGoooo
Copy link

I trained 70 epochs but the results are still bad, including the errors and the loss. The loss is always like 33, 34..., is it normal or something goes wrong?
微信图片_20240712155952

@wuzhiwei2001
Copy link

It is not related to the port. Make --nproc_per_node=1 pls

I set --nproc_per_node=1, but I am still getting the error torch.distributed.elastic.multiprocessing.errors.ChildFailedError. How can I resolve this issue?thank your reply

@A11en4z
Copy link

A11en4z commented Sep 5, 2024

It is not related to the port. Make --nproc_per_node=1 pls

I set --nproc_per_node=1, but I am still getting the error torch.distributed.elastic.multiprocessing.errors.ChildFailedError. How can I resolve this issue?thank your reply

Maybe you should update the torch version, torch 1.6+cuda10.1 doesn't support the latest graphics cards

@wuzhiwei2001
Copy link

Maybe you should update the torch version, torch 1.6+cuda10.1 doesn't support the latest graphics cards

Thank you very much for your reply! I can run it now!

@A11en4z
Copy link

A11en4z commented Sep 9, 2024

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

@wuzhiwei2001
Copy link

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

How much does your loss have to decrease before it stops decreasing? thanks!

@A11en4z
Copy link

A11en4z commented Sep 9, 2024

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

How much does your loss have to decrease before it stops decreasing? thanks!

Before I used 4gpu the loss would float around 32, 33 and the accuracy was poor, I then tried the weights posted by the author on a subset and the loss was around 15, you can use this as a reference

@wuzhiwei2001
Copy link

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

How much does your loss have to decrease before it stops decreasing? thanks!

Before I used 4gpu the loss would float around 32, 33 and the accuracy was poor, I then tried the weights posted by the author on a subset and the loss was around 15, you can use this as a reference

thanks!

@iamagoodboyyeyeye
Copy link

我训练了 70 个 epoch,但结果仍然很糟糕,包括错误和损失。损失总是像 33、34......,这是正常的还是出了什么问题? 微信图片_20240712155952

Hello, may I ask how many GPUs you used for training and have you produced the results now? I also encountered this problem with a GPU

@iamagoodboyyeyeye
Copy link

我分别尝试了单 GPU、双 GPU、batch=2,4,但它们都无法有效训练。但是当我使用 4 个 GPU、batch=2 时、没有 Loss 不减少的情况。

I also have problems using a single GPU, so is it a problem with the number of GPUs used?

@A11en4z
Copy link

A11en4z commented Oct 21, 2024

我分别尝试了单 GPU、双 GPU、batch=2,4,但它们都无法有效训练。但是当我使用 4 个 GPU、batch=2 时、没有 Loss 不减少的情况。

I also have problems using a single GPU, so is it a problem with the number of GPUs used?

hello,I tried 2 GPUs and increasing the batchsize and it didn't work, but I was able to train fine with 4 GPUs

@iamagoodboyyeyeye
Copy link

我分别尝试了单 GPU、双 GPU、batch=2,4,但它们都无法有效训练。但是当我使用 4 个 GPU、batch=2 时、没有 Loss 不减少的情况。

I also have problems using a single GPU, so is it a problem with the number of GPUs used?

hello,I tried 2 GPUs and increasing the batchsize and it didn't work, but I was able to train fine with 4 GPUs

Thank you for your reply. If possible, I would like to add a WeChat account to communicate. My WeChat ID is zhl15042182325

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants