Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encountered nan while trying to train #6

Open
liujuncn opened this issue Jul 25, 2023 · 10 comments
Open

encountered nan while trying to train #6

liujuncn opened this issue Jul 25, 2023 · 10 comments

Comments

@liujuncn
Copy link

liujuncn commented Jul 25, 2023

image

Trying to compare with other transformer architectures. But as soon as the training starts, the gradient encounters nan.
Other transformer architectures use the same data set and hyperparameters, it will not happen.

I don't know where there are numerical stability problems

@syncdoth
Copy link
Owner

Could you provide a minimal reproducible example? I have tried single GPU training outlined in the train.py of the repo with huggingface Trainer API without encountering numerical instability.

@liujuncn
Copy link
Author

liujuncn commented Jul 25, 2023

I looked at train.py, I think you are training with 32bit.

image

image

I also switched to 32bit mode, the training is fine, but the speed is very slow (same dataset, but less layers for memory cost).
So the problem is under 16bit AMP training.

@jploski
Copy link
Contributor

jploski commented Aug 5, 2023

FWIW, I also trained a mini-model (in 32-bit mode). I did not notice instability nor particularly bad training performance.

@daskol
Copy link

daskol commented Sep 4, 2023

@liujuncn Can you elaborate on your training setup? What model config? What dataset? What mixed precision regime do you use (compute, accumulate, params dtypes)? When does NaN appear in training? On what layer?

@jploski
Copy link
Contributor

jploski commented Sep 4, 2023

Unfortunately I don't have a minimal example either, but I also encountered these NaN/infinity problems while training a bigger model. They happened during forward pass, most commonly were right before GroupNorm. My model's config was:

vocab_size: 65024
hidden_size: 4544
num_layers: 4
num_heads: 8
qk_dim: 4544
v_dim: 9088
ffn_proj_size: 9088
use_bias_in_msr: False
use_bias_in_mlp: True
use_bias_in_msr_out: False
use_default_gamma: False
initializer_range: 0.02
output_retentions: False
pad_token_id: 11
eos_token_id: 11
unk_token_id: 11

That model was special in that I copied all the embeddings from a different model (Falcon-7B, hence the hiden_size) without any rescaling and also turned off gradients for them to prevent from being trained along with the MLPs (my questionable rationale being that the embeddings were already pretrained and good enough and should not be disturbed).

It was mentioned in another issue that there was a fix to torchscale's implementation aimed at improving numerical stability (microsoft/torchscale#47), so there may well be a similar problem here.

However, I did not manage to port that fix without compromising the consistency of recurrent/parallel pass, so I gave up on it.

In the end to get out of the instability I applied a dirty hack of inserting torch.nan_to_num at various vulnerable locations. Miraculously the weights converged to a more numerically stable configuration with that, and I could even remove these protections. But it surely does not seem like the correct approach.

@daskol
Copy link

daskol commented Sep 4, 2023

@jploski Thank you very much for you detailed comment. I've faced to the same instability issues with a model of similar size and I'm trying to figure out.

@syncdoth
Copy link
Owner

I couldn't try this model on a large training setting yet, and on my tiny synthetic dataset I didn't have issues. But great to know about these issues; let's work together to solve them!

I have been working on porting the official torchscale implementation to HF, which is almost finished except for chunkwise forward. In fact, I have pushed the branch (official_implementation) so if you want "early access" you can try it out and let me know if you find any bugs. The main difference should be some tricks for stability, which may hopefully stabilize training and even enable FP16 :)

@daskol
Copy link

daskol commented Sep 27, 2023

@syncdoth There is definitely some issues with stability with this implementation whilst everything is fine with torchscale.

I have been working on porting the official torchscale implementation to HF

why not just copypaste origina ltorchscale.architecture.retnet with all dependencies and wrap it with HF's model mix-ins? Repo torchscale is under MIT: no legal limitations.

@syncdoth
Copy link
Owner

syncdoth commented Oct 3, 2023

@daskol That's one way to do it, but some things can't be done that way, most important of which is attention_mask handling. This is not in torchscale repo.

Still, I followed your advice and have the torchscale/ directory, which is the basic copy & paste of the official code, with minor changes (args -> config, no moe, no multiway). I have tests/ that suggests that my implementation in retnet/ and torchscale/ are exactly the same functionally during forward/backwards.

I can also confirm that the model training is stable with bf16. Not sure with FP16, but not planning to test it out in the near future :(

@Shreyas-Dongre
Copy link

This might answer your query. microsoft/torchscale#48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants