-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encountered nan while trying to train #6
Comments
Could you provide a minimal reproducible example? I have tried single GPU training outlined in the |
FWIW, I also trained a mini-model (in 32-bit mode). I did not notice instability nor particularly bad training performance. |
@liujuncn Can you elaborate on your training setup? What model config? What dataset? What mixed precision regime do you use (compute, accumulate, params dtypes)? When does NaN appear in training? On what layer? |
Unfortunately I don't have a minimal example either, but I also encountered these NaN/infinity problems while training a bigger model. They happened during forward pass, most commonly were right before GroupNorm. My model's config was:
That model was special in that I copied all the embeddings from a different model (Falcon-7B, hence the hiden_size) without any rescaling and also turned off gradients for them to prevent from being trained along with the MLPs (my questionable rationale being that the embeddings were already pretrained and good enough and should not be disturbed). It was mentioned in another issue that there was a fix to torchscale's implementation aimed at improving numerical stability (microsoft/torchscale#47), so there may well be a similar problem here. However, I did not manage to port that fix without compromising the consistency of recurrent/parallel pass, so I gave up on it. In the end to get out of the instability I applied a dirty hack of inserting torch.nan_to_num at various vulnerable locations. Miraculously the weights converged to a more numerically stable configuration with that, and I could even remove these protections. But it surely does not seem like the correct approach. |
@jploski Thank you very much for you detailed comment. I've faced to the same instability issues with a model of similar size and I'm trying to figure out. |
I couldn't try this model on a large training setting yet, and on my tiny synthetic dataset I didn't have issues. But great to know about these issues; let's work together to solve them! I have been working on porting the official torchscale implementation to HF, which is almost finished except for chunkwise forward. In fact, I have pushed the branch ( |
@syncdoth There is definitely some issues with stability with this implementation whilst everything is fine with
why not just copypaste origina l |
@daskol That's one way to do it, but some things can't be done that way, most important of which is Still, I followed your advice and have the I can also confirm that the model training is stable with |
This might answer your query. microsoft/torchscale#48 |
Trying to compare with other transformer architectures. But as soon as the training starts, the gradient encounters nan.
Other transformer architectures use the same data set and hyperparameters, it will not happen.
I don't know where there are numerical stability problems
The text was updated successfully, but these errors were encountered: