Internal: Blas GEMM launch failed #534

zhao-w-en · 2021-08-30T07:58:09Z

zhao-w-en
Aug 30, 2021

Excuse me.I'm new in dpgen using and there're some problems after I installed it. When I use dpgen run para.json machine.json, I meet a error like this:

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(12880, 1), b.shape=(1, 25), m=12880, n=25, k=1
[[node filter_type_1/MatMul_8 (defined at /lib/python3.7/site-packages/deepmd/DescrptSeA.py:370) ]]
(1) Internal: Blas GEMM launch failed : a.shape=(12880, 1), b.shape=(1, 25), m=12880, n=25, k=1
[[node filter_type_1/MatMul_8 (defined at /lib/python3.7/site-packages/deepmd/DescrptSeA.py:370) ]]
[[l2_virial_test/_65]]
0 successful operations.
0 derived errors ignored.

It seems like something worry with GPU memory. Our GPU memory is 8G but dpgen neets more. What caused the memory footprint to be so large? Is it reasonable?
This is training part in my machine.json.

    "train_command" : "dp",
    "train_machine":	{
    "batch":	"shell",
    "work_path" : "/home/zhaow/WORK_SPACE/DPgen"
    },	
    "train_resources":	{
    "envs":		{
    }  
    },

This is training part in my para.json.Total atoms in system is 249.

    "type_map": ["C","O","H"],
    "mass_map": [12,16,1],
    "_comment": "initial data set for Training and the number of frames in each training batch",
    "init_data_prefix": "/home/zhaow/WORK_SPACE/DPgen/",
    "init_data_sys": ["initdataset/9.1-1000K-0-2ps"],
    "init_batch_size":	[1],
    "sys_configs_prefix": "/home/zhaow/WORK_SPACE/DPgen/",
    "sys_configs": [
        ["sysdataset/normal/POSCAR"],
        ["sysdataset/+10%/POSCAR"]
    ],
    "sys_batch_size":	[1, 1],
    
    "_comment": "  00.train  ",
    "numb_models": 4,
    "train_param": "input.json",
    "default_training_param": {
        "model": {
            "type_map": ["C","O","H"],
            "descriptor": {
                "type": "se_a",
                "sel": [62, 46, 92],
                "rcut_smth": 5.80,
                "rcut": 6.00,
                "neuron": [25, 50, 100],
                "resnet_dt": true,
                "axis_neuron": 4,
                "seed": 1
            },
            "fitting_net": {
                "neuron": [240, 240, 240],
                "resnet_dt": false,
                "seed": 1
            }
        },
        "learning_rate": {
            "type": "exp",
            "start_lr": 0.001,
            "decay_steps": 50,
            "decay_rate": 0.95
        },
        "loss": {
            "start_pref_e": 0.02,
            "limit_pref_e": 2,
            "start_pref_f": 1000,
            "limit_pref_f": 1,
            "start_pref_v": 0.0,
            "limit_pref_v": 0.0
        },
        "training": {
            "set_prefix": "set",
            "stop_batch": 1000,
            "batch_size": 1,
            "disp_file": "lcurve.out",
            "disp_freq": 100,
            "numb_test": 4,
            "save_freq": 100,
            "save_ckpt": "model.ckpt",
            "load_ckpt": "model.ckpt",
            "disp_training": true,
            "time_training": true,
            "profiling": false,
            "profiling_file": "timeline.json",
            "_comment": "that's all"
        }
    },

In order to save memeory, batch size are set as 1.
I'm really confused about this problem. So I have tried another machine.json.

"api_version": "1.0",
  "train" :[
    {
      "command": "dp",
      "machine": {
        "batch_type": "Shell",
        "context_type": "LazyLocalContext",
        "local_root": "./"      
      },
      "resources": {
        "number_node": 1,
        "cpu_per_node": 4,
        "gpu_per_node": 1,
        "queue_name": "batch",
        "group_size": 1,
        "strategy": {
            "if_cuda_multi_devices": false
        },
        "mem_limit": 12,
        "source_list": [
        ],
        "_comment": "that's all"
      }
    }
  ],

This code will submit four dp training, but only one training can work properly. Other faults in other train.logs like this:

2021-08-30 15:14:35.404081: W tensorflow/core/common_runtime/bfc_allocator.cc:441] ************************************************************************************________________
2021-08-30 15:14:35.405517: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.63G (7118275072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-30 15:14:35.406868: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.63G (7118275072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-30 15:14:35.406914: W tensorflow/core/common_runtime/bfc_allocator.cc:433] Allocator (GPU_0_bfc) ran out of memory trying to allocate 18.24MiB (rounded to 19123200)requested by op SameWorkerRecvDone

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_268_DescrptSeA;0:0
[[{{node DescrptSeA}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Reshape_24/_45]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_268_DescrptSeA;0:0
[[{{node DescrptSeA}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Sorry for any inconvenience caused. Could you please give me some help and advice?

njzjz · 2021-08-30T19:11:47Z

njzjz
Aug 30, 2021
Maintainer

Seems that you are not assign CUDA_VISIBLE_DEVICES for each task, so all tasks were using the first card. Here has an example to use multiple cards.

3 replies

zhao-w-en Aug 31, 2021
Author

Thank you for your patient reply. Yeah, you're right. After I set export CUDA_VISIBLE_DEVICES=1, four dp trainings works properly. But when I use nvidia-smi to check GPU memory, GPU was not be used. No running processes found. I think they just used CPUS in training networks.
In addition, just 1 GPU and 20 CPUS in my node. As I understand, one job contains four tasks. Wheter it because four tasks need four GPUs? I'm not sure if it's because the GPU can't run four tasks at once or it's not being called
I have changed my training part in machine.json and it seems that "gpu_per_node" : 1, didn't work.

   "train_command" : "dp",
    "train_machine":	{
    "batch":	"shell",
    "work_path" : "/home/zhaow/WORK_SPACE/DPgen"
    },	
    "train_resources":	{
    "number_node" : 1,
    "cpu_per_node" : 4,
    "gpu_per_node" : 1,
    "group_size": 1,
    "strategy": {
        "if_cuda_multi_devices": true
        },    
    "envs":		{
    }  
    },

To be honest, I have read DPDispatcher settings from your link and I still not sure about where to set CUDA_VISIBLE_DEVICES...
I'm really sorry for taking up your time...

njzjz Aug 31, 2021
Maintainer

Wheter it because four tasks need four GPUs

Yes.

You may increase group_size to more than 4 to let them execute one by one.

zhao-w-en Aug 31, 2021
Author

I have followed your advice and set "group_size": 10, the four trainings still run together. Explanation of group_size in your link says "The number of tasks in a job". So I also tried to set "group_size": 1 wishing the job will run one by one. Unfortunately, it seems to be no different. What should I do next?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal: Blas GEMM launch failed #534

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Internal: Blas GEMM launch failed #534

zhao-w-en Aug 30, 2021

Replies: 1 comment · 3 replies

njzjz Aug 30, 2021 Maintainer

zhao-w-en Aug 31, 2021 Author

njzjz Aug 31, 2021 Maintainer

zhao-w-en Aug 31, 2021 Author

zhao-w-en
Aug 30, 2021

Replies: 1 comment 3 replies

njzjz
Aug 30, 2021
Maintainer

zhao-w-en Aug 31, 2021
Author

njzjz Aug 31, 2021
Maintainer

zhao-w-en Aug 31, 2021
Author