-
Notifications
You must be signed in to change notification settings - Fork 11.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Here are four RTXA6000(48G) set each processing 4 images GPU memory overflow problem (IMAGE_MIN_DIM = 1024 IMAGE_MAX_DIM = 1024) (Batchsize = 16) #2895
Comments
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 5 root error(s) found.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (1) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (2) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (3) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (4) Resource exhausted: OOM when allocating tensor with shape[800,256,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 0 successful operations. |
At GPU_COUNT=1 and IMAGES_PER_GPU=8, the training was completed smoothly without any memory overflow, and I felt at a loss QAQ |
If you are training in windows, |
First of all, I wish all developers good health and all the best !
My training is divided into "head" and "all" phase, but at the beginning of training "all" phase, there was a GPU memory overflow error,
I was very surprised why four RTXA6000 (48G) would get GPU memory overflow error with GPU COUNT = 4 IMAGES_PER_GPU = 4 settings,
because mrcnn/config.py describes the IMAGES_PER_GPU parameter as follows:
Number of images to train with on each GPU. A 12GB GPU can typically handle 2 images of 1024x1024px.
Adjust based on your GPU memory and image sizes. Use the highest number that your GPU can handle for best performance.
What causes such a problem?
cuda11.3 + cudnn8.2.1 + tensorflow-gpu2.6 + python3.8
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 16
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 4
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 4
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 22
IMAGE_MIN_DIM 1024
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME medtest_2464_
NUM_CLASSES 10
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 308
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 50
WEIGHT_DECAY 0.0001
The text was updated successfully, but these errors were encountered: