Memory allocation in GPU plugin

Allocation types

GPU plugin supports 4 types of memory allocation as below. Note that the prefix usm_ indicates the allocation type using Intel Unified Shared Memory (USM) extension for OpenCL. For more detailed information about the USM extension, refer to this page.

cl_mem : Standard OpenCL cl_mem allocation
usm_host : Allocated in host memory and accessible by all of them. Not migratable.
usm_shared : Allocated in host and devices and accessible by all of them. The memories are automatically migrated on demand.
usm_device : Allocated in device memory and accessible only by the device which owns the memory. Not migratable.

Note that there are a few restrictions on a memory allocation from the drvier:

Allocation of a memory object should not exceed the available memory size obtained from CL_DEVICE_GLOBAL_MEM_SIZE
Total allocation of memory objects to a kernel (i.e., the sum of inputs, intermediate buffers, outputs of the kernel) should not exceed the target available memory. For example, if you want to allocate a memory object to the device memory, the above restrictions should be satisfied against the device memory. Otherwise, the memory object should be allocated to the host memory.

Memory allocation API

In GPU plugin, actual allocation for each allocation types is be done through engine::allocate_memory which calls the corresponding memory object wrapper for each allocation type: gpu_buffer, gpu_usm.

Also, the total allocated amount of memory for each allocation type are managed per engine, so that you can check the allocation history by setting environment variable OV_GPU_Verbose=1 for OpenVino built with ENABLE_DEBUG_CAPS=ON.

...
GPU_Debug: Allocate 58982400 bytes of usm_host allocation type (current=117969612; max=117969612)
GPU_Debug: Allocate 44621568 bytes of usm_device allocation type (current=44626380; max=44626380)
GPU_Debug: Allocate 44236800 bytes of usm_host allocation type (current=162206412; max=162206412)
GPU_Debug: Allocate 14873856 bytes of usm_device allocation type (current=59500236; max=59500236)
...

Allocated memory objects

The major allocation done in GPU plugin can be categorized as follows:

Constant memory allocation: In GPU plugin, constant data are hold by data primitive and the memory objects are allocated at the creation of the topology. At that time, the required data are copied from the corresponding blob in ngraph. After all transformation in program is finished and is the user of those memories are GPU operation and the GPU has device memory, then those constants are to be transferred to device memory. Note that constant data are shared within batches and streams.
Output memory allocation: A memory object to store the output result of each primitive is created at the creation of each primitive_inst (link), unless its output is reusing the input memory or the node is a mutable data to be used as a 2nd output. Note that the creation of a primitive_inst is done in an descending order of the output memory size for the purpose of memory reusing efficiency by the memory pool.
Intermediate memory allocation: Some primitives such as detection_output and non_max_suppression consisting of multiple kernels require intermediate memories to transfer data b/w those kernels. The allocation of such intermediate memories happens after all primitive_insts are finished (link), since it needs to be processed in a processing order to use the predecessors' allocation information to decide whether to allocate it on device memory or not by checking the memory allocation restriction described above.

Memory dependency and memory pool

In GPU plugin, multiple memory objects can be allocated at a same address, when there is no dependency between their users. For example, a memory region of a program_node A's output memory can be allocated for another program_node B's output, if the output of A is no longer used by any other program_node, when the result of the B is to be stored. In other words, memory region of A's output can be reused when it is no longer used by other nodes. This mechanism is realized by the following two parts;

Memory dependency : memory_dependencies of a program_node is added by memory dependency passes. There are two kinds of memory dependency passes as follows:
- basic_memory_dependencies : Assuming an in-order-queue execution, this pass adds dependencies to a program_node, which are deduced by checking its direct input and output nodes only.
- oooq_memory_dependencies : Assuming an out-of-order-queue execution, this pass adds dependencies to all pair of program_nodes that can potentially be executed at the same time.
Memory pool : The GPU plugin can use a memory_pool that returns a requested memory object obtained either by allocating newly or reusing from the already allocated memories. To decide whether to reuse the allocated memory or not, the memory_pool utilizes the memory dependencies set by the above two passes. Note that a memory_pool is managed by each network. Also, the primitive_insts are sorted in a descending order of the required memory size before allocating the outputs, in order to achieve a better memory reuse efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation in GPU plugin