You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to run the ML training code on my own dataset, which vocab size is 13911 as shown in follows. And in the command for ML training, I specified vocab_size=28000. However, after training on some batches, it would crash and report as the RuntimeError: cublas runtime error. Could you help me out there? Thanks in advance. The traceback information is as:
01/18/2021 21:22:01 [INFO] train: device : cuda:0
INFO:root:Loading vocab from disk: /users/tr.xiaow/kpRL/keyphrase-generation-rl-master/data/case73/
01/18/2021 21:22:01 [INFO] data_loader: Loading vocab from disk: /users/tr.xiaow/kpRL/keyphrase-generation-rl-master/data/case73/
INFO:root:#(vocab)=13911
01/18/2021 21:22:01 [INFO] data_loader: #(vocab)=13911
INFO:root:#(vocab used)=28000
01/18/2021 21:22:01 [INFO] data_loader: #(vocab used)=28000
INFO:root:Loading train and validate data from '/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/data/case73/'
01/18/2021 21:22:01 [INFO] data_loader: Loading train and validate data from '/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/data/case73/'
INFO:root:#(train data size: #(batch)=20900
01/18/2021 21:22:03 [INFO] data_loader: #(train data size: #(batch)=20900
INFO:root:#(valid data size: #(batch)=2359
01/18/2021 21:22:03 [INFO] data_loader: #(valid data size: #(batch)=2359
INFO:root:Time for loading the data: 1.9
01/18/2021 21:22:03 [INFO] train: Time for loading the data: 1.9
INFO:root:====================== Model Parameters =========================
01/18/2021 21:22:03 [INFO] train: ====================== Model Parameters =========================
INFO:root:Training a seq2seq model with copy mechanism
01/18/2021 21:22:03 [INFO] train: Training a seq2seq model with copy mechanism
/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1
"num_layers={}".format(dropout, num_layers))
INFO:root:====================== Start Training =========================
01/18/2021 21:22:07 [INFO] train_ml: ====================== Start Training =========================
Epoch 1; batch: 0; total batch: 0
Epoch 1; batch: 4000; total batch: 4000
Epoch 1; batch: 8000; total batch: 8000
Epoch 1; batch: 12000; total batch: 12000
Epoch 1; batch: 16000; total batch: 16000
Epoch 1; batch: 20000; total batch: 20000
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0ed90000030e'
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
Traceback (most recent call last):
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0ed70000030c'
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0eda0000030b'
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0ed80000030d'
Epoch 2; batch: 3100; total batch: 24000
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0edc00000310'
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
Traceback (most recent call last):
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0edd00000312'
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0edb0000030f'
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0ede00000311'
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorScatterGather.cu:151: void THCudaTensor_scatterAddKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [101,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorScatterGather.cu:151: void THCudaTensor_scatterAddKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [71,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorScatterGather.cu:151: void THCudaTensor_scatterAddKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [81,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorScatterGather.cu:151: void THCudaTensor_scatterAddKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [91,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorScatterGather.cu:151: void THCudaTensor_scatterAddKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [51,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorScatterGather.cu:151: void THCudaTensor_scatterAddKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [61,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0ed700000313'
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0ed800000314'
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0ed900000315'
Traceback (most recent call last):
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
finalizer()
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 486, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs00000000173a0eda00000316'
ERROR:root:message
Traceback (most recent call last):
File "train.py", line 164, in main
train_ml.train_model(model, optimizer_ml, optimizer_rl, criterion, train_data_loader, valid_data_loader, opt)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/train_ml.py", line 91, in train_model
valid_loss_stat = evaluate_loss(valid_data_loader, model, opt)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/evaluate.py", line 57, in evaluate_loss
decoder_dist, h_t, attention_dist, encoder_final_state, coverage, _, _, _ = model(src, src_lens, trg, src_oov, max_num_oov, src_mask, title=title, title_lens=title_lens, title_mask=title_mask)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/model.py", line 392, in forward
self.decoder(y_t, h_t, memory_bank, src_mask, max_num_oov, src_oov, coverage, decoder_memory_bank, h_te_t, g_t)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/rnn_decoder.py", line 121, in forward
context, attn_dist, coverage = self.attention_layer(last_layer_h_next, memory_bank, src_mask, coverage)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/attention.py", line 132, in forward
scores = self.score(memory_bank, decoder_state, coverage) # [batch_size, max_input_seq_len]
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/attention.py", line 63, in score
encoder_feature = self.memory_project(memory_bank_) # [batch_size*max_input_seq_len, decoder size]
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 67, in forward
return F.linear(input, self.weight, self.bias)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/functional.py", line 1354, in linear
output = input.matmul(weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258
01/18/2021 21:29:43 [ERROR] train: message
Traceback (most recent call last):
File "train.py", line 164, in main
train_ml.train_model(model, optimizer_ml, optimizer_rl, criterion, train_data_loader, valid_data_loader, opt)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/train_ml.py", line 91, in train_model
valid_loss_stat = evaluate_loss(valid_data_loader, model, opt)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/evaluate.py", line 57, in evaluate_loss
decoder_dist, h_t, attention_dist, encoder_final_state, coverage, _, _, _ = model(src, src_lens, trg, src_oov, max_num_oov, src_mask, title=title, title_lens=title_lens, title_mask=title_mask)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/model.py", line 392, in forward
self.decoder(y_t, h_t, memory_bank, src_mask, max_num_oov, src_oov, coverage, decoder_memory_bank, h_te_t, g_t)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/rnn_decoder.py", line 121, in forward
context, attn_dist, coverage = self.attention_layer(last_layer_h_next, memory_bank, src_mask, coverage)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/attention.py", line 132, in forward
scores = self.score(memory_bank, decoder_state, coverage) # [batch_size, max_input_seq_len]
File "/users/tr.xiaow/kpRL/keyphrase-generation-rl-master/pykp/attention.py", line 63, in score
encoder_feature = self.memory_project(memory_bank_) # [batch_size*max_input_seq_len, decoder size]
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 67, in forward
return F.linear(input, self.weight, self.bias)
File "/users/tr.xiaow/anaconda3/envs/kpRL/lib/python3.6/site-packages/torch/nn/functional.py", line 1354, in linear
output = input.matmul(weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258
Xiao
The text was updated successfully, but these errors were encountered:
Hi kenchan,
I tried to run the ML training code on my own dataset, which vocab size is 13911 as shown in follows. And in the command for ML training, I specified vocab_size=28000. However, after training on some batches, it would crash and report as the RuntimeError: cublas runtime error. Could you help me out there? Thanks in advance. The traceback information is as:
Xiao
The text was updated successfully, but these errors were encountered: