-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime errror #335
Comments
It sounds like you found a solution. What was the issue? I'm glad to hear this repository has been useful for you. |
thanks for your answer
I changed the prefetch to 2 as default, because lib/python3.10/site-packages/torch/utils/data/dataloader.py is medddling around with a lower prefetch. This seems to do the job. |
There was a bug in the example configuration file and the example jupyter notebook that used "folds" instead of "n_folds" and so the number of folds was not being set properly. PR #337 fixed this issue for me. That being said, using "n_folds" of 1 causes a different error. So if you want to not run cross validation just take the "cross_validation" key out of the configuration file.
The input and output images are 4D with the channels in the last image dimension. If you load the image using nibabel and print the shape for the brats2020 example the shape will be (128, 128, 128, 4), with the "4" referring to the 4 input images. |
Cear ellis The runtime error seems to be an issue of GPU memory which seems just to be enough, sometimes not (2080ti). I will check all that and keep you posted |
excuse the typos :) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If you are still wanting followup to this issue, please ping the thread by leaving a comment. You may also contact david.ellis@unmc.edu with questions. |
Thanks for rewriting the program which has a much better data input. the first version was my workhorse over the last years.
I installed the new version and started the program as suggested. After some time I get the following message
Validation: [71/73] Time 0.352 ( 0.359) Loss 5.7630e-01 (3.0972e-01)
Validation: [72/73] Time 0.353 ( 0.358) Loss 3.5241e-01 (3.1031e-01)
Validation: [73/73] Time 0.361 ( 0.359) Loss 7.5337e-01 (3.1638e-01)
Epoch: [4][ 1/296] Time 1.281 ( 1.281) Data 0.479 ( 0.479) Loss 1.9667e-01 (1.9667e-01)
Epoch: [4][ 2/296] Time 1.934 ( 1.607) Data 0.033 ( 0.256) Loss 2.9035e-01 (2.4351e-01)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 177, in
main()
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 173, in main
run(config_filename, output_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 76, in run
run(_config_filename, work_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 131, in run
run_training(model=model.train(), optimizer=optimizer, criterion=criterion,
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/train/train.py", line 55, in run_training
losses.append(epoch_training(training_loader, model, criterion, optimizer=optimizer, epoch=epoch,
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/train/training_utils.py", line 40, in epoch_training
for i, item in enumerate(train_loader):
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
idx, data = self._get_data()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
success, data = self._try_get_data()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 297, in rebuild_storage_fd
fd = df.detach()
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/home//anaconda3/envs/ellise_new/lib/python3.10/multiprocessing/reduction.py", line 164, in recvfds
raise RuntimeError('received %d items of ancdata' %
I tried another run with less data:
after some time the following popped up:
poch: [250][7/8] Time 1.849 ( 1.829) Data 0.008 ( 0.107) Loss 1.1240e-01 (1.4461e-01)
Epoch: [250][8/8] Time 1.847 ( 1.831) Data 0.009 ( 0.095) Loss 1.1719e-01 (1.4118e-01)
Validation: [1/2] Time 0.715 ( 0.715) Loss 1.6420e-01 (1.6420e-01)
Validation: [2/2] Time 0.355 ( 0.535) Loss 2.2418e-01 (1.9419e-01)
2024-02-16 12:19:10,167 - root - DEBUG - Could not find value for key 'validation'; default to {}
2024-02-16 12:19:10,167 - root - DEBUG - Found value '1' for key 'validation_batch_size'
2024-02-16 12:19:10,167 - root - DEBUG - Could not find value for key 'prefetch_factor'; default to None
2024-02-16 12:19:10,167 - root - INFO - Found inference filenames: bratsvalidation (n=10)
2024-02-16 12:19:10,167 - root - DEBUG - Found value '12' for key 'n_workers'
2024-02-16 12:19:10,167 - root - DEBUG - Found value 'False' for key 'pin_memory'
Traceback (most recent call last):
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 177, in
main()
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 173, in main
run(config_filename, output_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 76, in run
run(_config_filename, work_dir, namespace)
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py", line 149, in run
for _dataloader, _name in build_inference_loaders_from_config(config,
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/script_utils.py", line 168, in build_inference_loaders_from_config
inference_dataloaders.append([build_inference_loader(filenames=config[key],
File "/home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/script_utils.py", line 189, in build_inference_loader
_loader = DataLoader(_dataset,
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/monai/data/dataloader.py", line 106, in init
super().init(dataset=dataset, num_workers=num_workers, **kwargs)
File "/home//anaconda3/envs/ellise_new/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 232, in init
assert prefetch_factor > 0
TypeError: '>' not supported between instances of 'NoneType' and 'int'
the starting command line was : python /home//aktuelle_NETS/3DUnetCNN_new/unet3d/scripts/train.py --nthreads 12 --config_filename test_brats2020_config.json --output_dir ./out
THis seems not related to the "netsoftwware itself" but to the python version. I have 3.10 in this conda env. What's the one you are using?
Any other ideas.
Thanks for the help
Alex
The text was updated successfully, but these errors were encountered: