Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Function 'DotBackward0' returned nan values in its 0th output. #8

Open
Bingo-1996 opened this issue Jun 27, 2022 · 9 comments

Comments

@Bingo-1996
Copy link

Hi~ Thank you for releasing the code. When I run the training code, the loss will appear Nan after several epochs. I have tried three times and encountered the same problem. I did not modify any parameters. Can you give me some advice?

Screenshot from 2022-06-27 12-00-23

@lolrudy
Copy link
Owner

lolrudy commented Jun 27, 2022

Which version of code are you using? The main branch or the shape prior one?

@lolrudy
Copy link
Owner

lolrudy commented Jun 27, 2022

I encountered this error before. There are two potential reasons:

  1. The sampled point cloud has no point since the ground truth mask is errorneous.
  2. The bounding box voting process includes computing the inverse matrix. A non-full rank matrix yields the error.
    You might need to check the loss value before backward the loss and mask out NaN.

@Bingo-1996
Copy link
Author

Which version of code are you using? The main branch or the shape prior one?

I use the main branch

@Bingo-1996
Copy link
Author

Thank you for your advice. I will try it

@HannahHaensen
Copy link

@Bingo-1996 can you post how you solved this?
or have you solved it?
:) thanks in advance

@HannahHaensen
Copy link

@lolrudy and @Bingo-1996
like that?

# backward
shape = total_loss.shape
total_loss = total_loss.reshape(shape[0], -1)
# Drop all rows containing any nan:
total_loss = total_loss[~torch.any(total_loss.isnan(), dim=1)]
# Reshape back:
 total_loss = total_loss.reshape(total_loss.shape[0], *shape[1:])

@Bingo-1996
Copy link
Author

@HannahHaensen I haven't solved this problem, and there are other problems when I use the shape prior branch. Did you solve the problem?

@lolrudy
Copy link
Owner

lolrudy commented Jul 21, 2022

Yes, this should work. Or you can set all nan values to 0.

@HannahHaensen
Copy link

@Bingo-1996 not sure yet this error occured for me after ~30 epochs not there again yet but if the training passes I can confirm or decline :) and i am on the main branch not the shape prior

@lolrudy thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants