Memory allocation in GPU plugin

Allocation types

GPU plugin supports 4 types of memory allocation as follows. Here "usm_" allocation types are the allocation using Intel Unified Shared Memory (USM) extension for OpenCL. For more detailed explanation about the USM extension, refer to this page.

cl_mem : Standard OpenCL cl_mem allocation
usm_host : Allocated in host memory and accessible by all of them. Not migratable.
usm_shared : Allocated in host and devices and accessible by all of them. The memories are automatically migrated on demand.
usm_device : Allocated in device memory and accessible only by the device which owns the memory. Not migratable.

Note that there are following restrictions on a memory allocation by a driver:

Allocation of a memory object should not exceed the available memory size obtained from CL_DEVICE_GLOBAL_MEM_SIZE
Total allocation of memory objects to a kernel (i.e., the sum of inputs, intermediate buffers, outputs of the kernel) should not exceed the available memory. For example, if you want to allocate a memory object to the device memory, the above restrictions should be met for the available device memory. Otherwise, the memory object should be allocated to host memory.

Memory allocation API

In GPU plugin, actual allocation for each allocation types is be done through engine::allocate_memory which calls the corresponding memory object wrapper for each allocation type: gpu_buffer, gpu_usm.

Also, the total allocated amount of memory for each allocation type are managed per engine, so that you can check the allocation history by setting environment variable OV_GPU_Verbose=1 for OpenVino built with ENABLE_DEBUG_CAPS=ON.

...
GPU_Debug: Allocate 58982400 bytes of usm_host allocation type (current=117969612; max=117969612)
GPU_Debug: Allocate 44621568 bytes of usm_device allocation type (current=44626380; max=44626380)
GPU_Debug: Allocate 44236800 bytes of usm_host allocation type (current=162206412; max=162206412)
GPU_Debug: Allocate 14873856 bytes of usm_device allocation type (current=59500236; max=59500236)
...

Allocated memory objects

The major allocation done in GPU plugin can be categorized as follows:

Constant memory allocation: In GPU plugin, constant data are hold by data primitive and the memory objects are allocated at the creation of the topology. At that time, the required data are copied from the corresponding blob in ngraph. After all transformation in program is finished and is the user of those memories are GPU operation and the GPU has device memory, then those constants are to be transferred to device memory. Note that constant data are shared within batches and streams.
Output memory allocation: A memory object to store the output result of each primitive is created at the creation of each primitive_inst (link), unless its output is reusing the input memory or the node is a mutable data to be used as a 2nd output. Note that the creation of a primitive_inst is done in an descending order of the output memory size for the purpose of memory reusing efficiency by the memory pool.
Intermediate memory allocation: Some primitives such as detection_output and non_max_suppression consisting of multiple kernels require intermediate memories to transfer data b/w those kernels. The allocation of such intermediate memories happens after all primitive_insts are finished (link), since it needs to be processed in a processing order to use the predecessors' allocation information to decide whether to allocate it on device memory or not by checking the memory allocation restriction described above.

Memory dependency and memory pool

In GPU plugin, multiple memory objects can be allocated at a same address, when there is no dependency between their users. For example, a memory region of a program_node A's output memory can be allocated for another program_node B's output, if the output of A is no longer used by any other program_node, when the result of the B is to be stored. In other words, memory region of A's output can be reused when it is no longer used by other nodes. This mechanism is realized by the following two parts;

Memory dependency : memory_dependencies of a program_node is added by memory dependency passes. There are two kinds of memory dependency passes as follows:
- basic_memory_dependencies : Assuming an in-order-queue execution, this pass adds dependencies to a program_node, which are deduced by checking its direct input and output nodes only.
- oooq_memory_dependencies : Assuming an out-of-order-queue execution, this pass adds dependencies to all pair of program_nodes that can potentially be executed at the same time.
Memory pool : The GPU plugin can use a memory_pool that returns a requested memory object obtained either by allocating newly or reusing from the already allocated memories. To decide whether to reuse the allocated memory or not, the memory_pool utilizes the memory dependencies set by the above two passes. Note that the memory_pool is created per network. Also, the primitive_insts are sorted in a descending order of the required memory size before allocating the outputs for better memory reuse efficiency of the memory_pool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation in GPU plugin