Replies: 1 comment 3 replies
-
Seems that you are not assign |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Excuse me.I'm new in dpgen using and there're some problems after I installed it. When I use
dpgen run para.json machine.json
, I meet a error like this:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(12880, 1), b.shape=(1, 25), m=12880, n=25, k=1
[[node filter_type_1/MatMul_8 (defined at /lib/python3.7/site-packages/deepmd/DescrptSeA.py:370) ]]
(1) Internal: Blas GEMM launch failed : a.shape=(12880, 1), b.shape=(1, 25), m=12880, n=25, k=1
[[node filter_type_1/MatMul_8 (defined at /lib/python3.7/site-packages/deepmd/DescrptSeA.py:370) ]]
[[l2_virial_test/_65]]
0 successful operations.
0 derived errors ignored.
It seems like something worry with GPU memory. Our GPU memory is 8G but dpgen neets more. What caused the memory footprint to be so large? Is it reasonable?
This is training part in my machine.json.
This is training part in my para.json.Total atoms in system is 249.
In order to save memeory, batch size are set as 1.
I'm really confused about this problem. So I have tried another machine.json.
This code will submit four dp training, but only one training can work properly. Other faults in other train.logs like this:
2021-08-30 15:14:35.404081: W tensorflow/core/common_runtime/bfc_allocator.cc:441] ************************************************************************************________________
2021-08-30 15:14:35.405517: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.63G (7118275072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-30 15:14:35.406868: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.63G (7118275072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-30 15:14:35.406914: W tensorflow/core/common_runtime/bfc_allocator.cc:433] Allocator (GPU_0_bfc) ran out of memory trying to allocate 18.24MiB (rounded to 19123200)requested by op SameWorkerRecvDone
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_268_DescrptSeA;0:0
[[{{node DescrptSeA}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Reshape_24/_45]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_268_DescrptSeA;0:0
[[{{node DescrptSeA}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Sorry for any inconvenience caused. Could you please give me some help and advice?
Beta Was this translation helpful? Give feedback.
All reactions