Model not getting trained on single GPU #32

aryanmangal769 · 2023-08-28T19:12:10Z

When I try to train on single GPU, the error keeps on increasing and I cannot see any good results even thill 38th epoch.

train_class_error starts from 97.88 and deom 19th to 37th epoch its consistently 100. Can you debug this?

Please let me know if you need some more information

yrcong · 2023-10-20T05:41:12Z

We train the model for 150 epochs. 38th epoch might be just warm-up. Maybe you can try to load some pretrained weights to accelerate the training?

qql-Kenneth · 2024-04-22T11:17:19Z

@aryanmangal769 bro! How to train the model on One GPU?

qql-Kenneth · 2024-04-22T11:33:41Z

@me I add os.environ['MASTER_PORT'] = '8889' in main.py

yrcong · 2024-04-22T18:38:13Z

It is not related to the port. Make --nproc_per_node=1 pls

AlphaGoooo · 2024-07-12T08:00:34Z

I trained 70 epochs but the results are still bad, including the errors and the loss. The loss is always like 33, 34..., is it normal or something goes wrong?

wuzhiwei2001 · 2024-09-04T09:41:07Z

It is not related to the port. Make --nproc_per_node=1 pls

I set --nproc_per_node=1, but I am still getting the error torch.distributed.elastic.multiprocessing.errors.ChildFailedError. How can I resolve this issue?thank your reply

A11en4z · 2024-09-05T01:53:18Z

It is not related to the port. Make --nproc_per_node=1 pls

I set --nproc_per_node=1, but I am still getting the error torch.distributed.elastic.multiprocessing.errors.ChildFailedError. How can I resolve this issue?thank your reply

Maybe you should update the torch version, torch 1.6+cuda10.1 doesn't support the latest graphics cards

wuzhiwei2001 · 2024-09-05T08:02:20Z

Maybe you should update the torch version, torch 1.6+cuda10.1 doesn't support the latest graphics cards

Thank you very much for your reply! I can run it now!

A11en4z · 2024-09-09T07:48:41Z

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

wuzhiwei2001 · 2024-09-09T07:56:02Z

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

How much does your loss have to decrease before it stops decreasing? thanks!

A11en4z · 2024-09-09T08:03:26Z

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

How much does your loss have to decrease before it stops decreasing? thanks!

Before I used 4gpu the loss would float around 32, 33 and the accuracy was poor, I then tried the weights posted by the author on a subset and the loss was around 15, you can use this as a reference

wuzhiwei2001 · 2024-09-09T08:06:46Z

I tried single GPU, dual GPU, batch=2,4 respectively and none of them could train effectively. But when I use 4 GPUs, batch=2 there is no case of Loss not decreasing.

How much does your loss have to decrease before it stops decreasing? thanks!

Before I used 4gpu the loss would float around 32, 33 and the accuracy was poor, I then tried the weights posted by the author on a subset and the loss was around 15, you can use this as a reference

thanks!

iamagoodboyyeyeye · 2024-10-21T07:55:52Z

我训练了 70 个 epoch，但结果仍然很糟糕，包括错误和损失。损失总是像 33、34......，这是正常的还是出了什么问题？

Hello, may I ask how many GPUs you used for training and have you produced the results now? I also encountered this problem with a GPU

iamagoodboyyeyeye · 2024-10-21T08:01:32Z

我分别尝试了单 GPU、双 GPU、batch=2,4，但它们都无法有效训练。但是当我使用 4 个 GPU、batch=2 时、没有 Loss 不减少的情况。

I also have problems using a single GPU, so is it a problem with the number of GPUs used?

A11en4z · 2024-10-21T08:03:51Z

我分别尝试了单 GPU、双 GPU、batch=2,4，但它们都无法有效训练。但是当我使用 4 个 GPU、batch=2 时、没有 Loss 不减少的情况。

I also have problems using a single GPU, so is it a problem with the number of GPUs used?

hello,I tried 2 GPUs and increasing the batchsize and it didn't work, but I was able to train fine with 4 GPUs

iamagoodboyyeyeye · 2024-10-21T08:12:02Z

我分别尝试了单 GPU、双 GPU、batch=2,4，但它们都无法有效训练。但是当我使用 4 个 GPU、batch=2 时、没有 Loss 不减少的情况。

I also have problems using a single GPU, so is it a problem with the number of GPUs used?

hello,I tried 2 GPUs and increasing the batchsize and it didn't work, but I was able to train fine with 4 GPUs

Thank you for your reply. If possible, I would like to add a WeChat account to communicate. My WeChat ID is zhl15042182325

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model not getting trained on single GPU #32

Model not getting trained on single GPU #32

aryanmangal769 commented Aug 28, 2023

yrcong commented Oct 20, 2023

qql-Kenneth commented Apr 22, 2024

qql-Kenneth commented Apr 22, 2024

yrcong commented Apr 22, 2024

AlphaGoooo commented Jul 12, 2024

wuzhiwei2001 commented Sep 4, 2024

A11en4z commented Sep 5, 2024 •

edited

Loading

wuzhiwei2001 commented Sep 5, 2024

A11en4z commented Sep 9, 2024

wuzhiwei2001 commented Sep 9, 2024

A11en4z commented Sep 9, 2024

wuzhiwei2001 commented Sep 9, 2024

iamagoodboyyeyeye commented Oct 21, 2024

iamagoodboyyeyeye commented Oct 21, 2024

A11en4z commented Oct 21, 2024

iamagoodboyyeyeye commented Oct 21, 2024

Model not getting trained on single GPU #32

Model not getting trained on single GPU #32

Comments

aryanmangal769 commented Aug 28, 2023

yrcong commented Oct 20, 2023

qql-Kenneth commented Apr 22, 2024

qql-Kenneth commented Apr 22, 2024

yrcong commented Apr 22, 2024

AlphaGoooo commented Jul 12, 2024

wuzhiwei2001 commented Sep 4, 2024

A11en4z commented Sep 5, 2024 • edited Loading

wuzhiwei2001 commented Sep 5, 2024

A11en4z commented Sep 9, 2024

wuzhiwei2001 commented Sep 9, 2024

A11en4z commented Sep 9, 2024

wuzhiwei2001 commented Sep 9, 2024

iamagoodboyyeyeye commented Oct 21, 2024

iamagoodboyyeyeye commented Oct 21, 2024

A11en4z commented Oct 21, 2024

iamagoodboyyeyeye commented Oct 21, 2024

A11en4z commented Sep 5, 2024 •

edited

Loading