Here are four RTXA6000(48G) set each processing 4 images GPU memory overflow problem (IMAGE_MIN_DIM = 1024 IMAGE_MAX_DIM = 1024) (Batchsize = 16) #2895

xxxming730 · 2022-10-19T01:55:20Z

First of all, I wish all developers good health and all the best ！

My training is divided into "head" and "all" phase, but at the beginning of training "all" phase, there was a GPU memory overflow error,
I was very surprised why four RTXA6000 (48G) would get GPU memory overflow error with GPU COUNT = 4 IMAGES_PER_GPU = 4 settings,
because mrcnn/config.py describes the IMAGES_PER_GPU parameter as follows:

Number of images to train with on each GPU. A 12GB GPU can typically handle 2 images of 1024x1024px.

Adjust based on your GPU memory and image sizes. Use the highest number that your GPU can handle for best performance.

What causes such a problem?

    My model building environment is:

cuda11.3 + cudnn8.2.1 + tensorflow-gpu2.6 + python3.8

    My config are as follows ：

Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 16
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 4
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 4
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 22
IMAGE_MIN_DIM 1024
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME medtest_2464_
NUM_CLASSES 10
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 308
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 50
WEIGHT_DECAY 0.0001

The text was updated successfully, but these errors were encountered:

xxxming730 · 2022-10-19T05:59:20Z

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 5 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_2/SGD/gradients/gradients/tower_0/mask_rcnn/mrcnn_mask/conv2d_6/Conv2D_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[tower_2/mask_rcnn/proposal_targets/strided_slice_128/_27411]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(1) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_2/SGD/gradients/gradients/tower_0/mask_rcnn/mrcnn_mask/conv2d_6/Conv2D_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[tower_2/mask_rcnn/proposal_targets/RandomShuffle_7/_27767]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(2) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_2/SGD/gradients/gradients/tower_0/mask_rcnn/mrcnn_mask/conv2d_6/Conv2D_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[tower_1/mask_rcnn/proposal_targets/RandomShuffle_2/_27389]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(3) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_2/SGD/gradients/gradients/tower_0/mask_rcnn/mrcnn_mask/conv2d_6/Conv2D_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

 [[tower_3/mask_rcnn/proposal_targets/RandomShuffle_7/_27775]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

(4) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_2/SGD/gradients/gradients/tower_0/mask_rcnn/mrcnn_mask/conv2d_6/Conv2D_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.

xxxming730 · 2022-10-19T06:39:42Z

At GPU_COUNT=1 and IMAGES_PER_GPU=8, the training was completed smoothly without any memory overflow, and I felt at a loss QAQ

CYHooo · 2023-01-16T03:23:32Z

At GPU_COUNT=1 and IMAGES_PER_GPU=8, the training was completed smoothly without any memory overflow, and I felt at a loss QAQ

If you are training in windows,
maybe you can change strategy likes:
strategy = tf.distribute.MirroredStrategy(["GPU:0","GPU:1","GPU:2","GPU:3"],cross_device_ops=tf.distribute.ReductionToOneDevice())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Here are four RTXA6000(48G) set each processing 4 images GPU memory overflow problem (IMAGE_MIN_DIM = 1024 IMAGE_MAX_DIM = 1024) (Batchsize = 16) #2895

Here are four RTXA6000(48G) set each processing 4 images GPU memory overflow problem (IMAGE_MIN_DIM = 1024 IMAGE_MAX_DIM = 1024) (Batchsize = 16) #2895

xxxming730 commented Oct 19, 2022

xxxming730 commented Oct 19, 2022

xxxming730 commented Oct 19, 2022

CYHooo commented Jan 16, 2023

Here are four RTXA6000(48G) set each processing 4 images GPU memory overflow problem (IMAGE_MIN_DIM = 1024 IMAGE_MAX_DIM = 1024) (Batchsize = 16) #2895

Here are four RTXA6000(48G) set each processing 4 images GPU memory overflow problem (IMAGE_MIN_DIM = 1024 IMAGE_MAX_DIM = 1024) (Batchsize = 16) #2895

Comments

xxxming730 commented Oct 19, 2022

Number of images to train with on each GPU. A 12GB GPU can typically handle 2 images of 1024x1024px.

Adjust based on your GPU memory and image sizes. Use the highest number that your GPU can handle for best performance.

xxxming730 commented Oct 19, 2022

xxxming730 commented Oct 19, 2022

CYHooo commented Jan 16, 2023