-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue resuming training on tansformer based NER #6323
Comments
I noticed from the stack trace that error raises on evaluation over the dev set, so I reduced its size to half and now it works ok. |
I'm a bit confused, when you originally trained the model, didn't you evaluate it on the dev set? |
I did evaluate it on the dev set, and it worked, hence my confusion as well: why does it work when training from scratch but it fails when attempting to re-train? |
And the pipeline was otherwise entirely the same? So there are no differences between the first training run and the resuming run (other than the "source" bit in the config, ofcourse) |
Yes, exactly the same. Also the same train and dev data. |
Looking at the output:
It appears that your memory is already occupied somewhere else. I'm not sure how this fits together with the shell command (this should actually not apply here), but PyTorch can sometimes be a bit problematic when it comes to releasing GPU memory. Edit: It could of course be that the allocation happens gradually and only the last part is shown. But it may be worth checking the memory allocation before resuming the training. |
Either way, this part in language.evaluate() is probably the culprit:
This was added just for timing purposes. I think the code should simply still run without these two lines though - any chance you can check whether removing them improves things memory-wise? (your timing results will be temporarily wrong but let's worry about that later) |
Oops, yeah that should be done differently. (But I don't understand why this ends up different in the second round than in the first?) |
Great! Thanks a lot for the workaround, I will test this and post an update. |
Just FYI the workaround did not work, I still get the same error in this line:
I pulled your fix @adrianeboyd and still I get the same OOM exception at language.py line 1319:
Is this just for timing purposes? Can I safely remove those lines? Thanks! |
It isn't just for timing purposes because you're not actually running the final component (which is the NER model you're trying to train) unless you iterate over that generator. (Earlier versions had the scorer iterate over this generator, and the overall goal here was to separate the pipeline timing from the scorer timing.) I think the previous version was still a bit clunky so I've reworked it a bit more. Can you try the updated version here? #6386 Looking at this again, I think the problem might actually be that the default We're also running into some memory issues internally on CPU (for Since this is something that may need to be adjusted and have different defaults for CPU vs. GPU, I think we'll most likely need a way to specify the batch size for evaluate from the config, but I'm not sure exactly how yet. We may need to add a (And I still don't know what's going on with the differences between training from scratch and resuming.) |
I'm using the nightly version, I have successfully trained a transformer based NER model and saved it; now I'm trying to resume training on it.
Firstly, I'm not sure if I have set up the config file correctly, the relevant part looks like this:
Now, after trying to train like this:
!python -m spacy train 'config.cfg' --output='model_t' --gpu-id=0 --paths.train train.spacy --paths.dev test.spacy
I'm getting this error message:
I understand the message is telling me I'm out of memory, but it seems weird that I'm able to train from scratch with no issues but getting this error when trying to resume training on the saved model. Any help is appreciated.
Your Environment
The text was updated successfully, but these errors were encountered: