[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

schrodingho · 2025-01-15T08:39:05Z

OpenVINO Version

Master Branch

Operating System

Windows System

Device used for inference

dGPU

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

https://github.com/autonomousvision/unimatch

Model quantization

No

Target Platform

OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0

Performance issue description

I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.

To reduce latency, I replaced the PyTorch function F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True) with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch for the original model and opt_unimatch for the modified one).

ori_unimatch:

benchmark_app -m ori_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 75.63 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 3059.41 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 460.74 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to 
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            131 iterations
[ INFO ] Duration:         60207.90 ms
[ INFO ] Latency:
[ INFO ]    Median:        458.77 ms
[ INFO ]    Average:       458.70 ms
[ INFO ]    Min:           452.05 ms
[ INFO ]    Max:           465.72 ms
[ INFO ] Throughput:   4.35 FPS

opt_unimatch:

benchmark_app -m opt_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 80.84 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 8530.97 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 242.54 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            278 iterations
[ INFO ] Duration:         60109.22 ms
[ INFO ] Latency:
[ INFO ]    Median:        215.37 ms
[ INFO ]    Average:       215.41 ms
[ INFO ]    Min:           205.85 ms
[ INFO ]    Max:           229.31 ms
[ INFO ] Throughput:   9.25 FPS

Step-by-step reproduction

Clone the Unimatch.
Download the pretrained model GMFlow-scale2-regrefine6-mixdata from the Model_Zoo and save it the pretrained folder.
Follow the script gmflow_demo.sh in Scripts to run the model:

python main_flow.py \
--inference_dir demo/flow-davis \
--resume pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth \
--output_path output/gmflow-scale2-regrefine6-davis \
--padding_factor 16 \
--upsample_factor 4 \
--num_scales 2 \
--attn_splits_list 2 8 \
--corr_radius_list -1 4 \
--prop_radius_list -1 1 \
--reg_refine \
--num_reg_refine 2

Add OpenVINO converting code in it and compile the model.

from pathlib import Path
import openvino as ov
ov_opt_device = "cpu"
model_without_ddp = model_without_ddp.to(ov_opt_device)

FIG_H = 320
FIG_W = 576

dummy_input1 = torch.randn(2, 3, FIG_H, FIG_W)
dummy_input2 = torch.randn(2, 3, FIG_H, FIG_W)

example_inputs = (
    dummy_input1,
    dummy_input2,
)
inputs = {
    "img0": dummy_input1,
    "img1": dummy_input2,
}
input_info = [(name, list(inp.shape)) for name, inp in inputs.items()]
UNIMATCH_OV_PATH = Path(f"opt_unimatch.xml")
model_without_ddp.eval()

with torch.no_grad():
    ov_model = ov.convert_model(model_without_ddp, input=input_info, example_input=example_inputs)
    ov.save_model(ov_model, UNIMATCH_OV_PATH, compress_to_fp16=True)

Use benchmark_app to profile it.

benchmark_app -m %converted_model%.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

Change the F.gridsample in /unimatch/matching.py to this implementation, and redo the step 4 and 5.

Issue submission checklist

I'm reporting a performance issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.

The text was updated successfully, but these errors were encountered:

dnkurek · 2025-01-15T19:30:19Z

Hi, do you also have the same issue with the iGPU or CPU in your system?

Could be that simply grid_sample kernel was not optimized at all, since you are running the slow reference version. This would probably involve writing a opt version instead

schrodingho · 2025-01-16T02:36:55Z

Hi, I just ran benchmarks on the iGPU (UHD 770) and the CPU (i9-13900K). The iGPU has the same issue (the grid_sample_ref is slow):

ori_unimatch

opt_unimatch

For CPU, it seems to have no such issue (original is better):

ori_unimatch

opt_unimatch

dnkurek · 2025-01-16T04:50:23Z

Yeah so it looks like grid_sample_ref needs to be optimized and make a grid_sample_opt version perhaps...

mlukasze · 2025-01-24T07:26:00Z

ref ticket: CVS-161002

hey @schrodingho
we've checked few things and it fails as "attn_splits_list" should not be None, but when we try to trace the model you suggested this attn_splits_list is not set, which leads to fail of tracing.
Could you share with us how exactly pytorch model was created before passing to convert_model() or provide a working script?

schrodingho · 2025-01-27T02:59:24Z

ref ticket: CVS-161002

hey @schrodingho we've checked few things and it fails as "attn_splits_list" should not be None, but when we try to trace the model you suggested this attn_splits_list is not set, which leads to fail of tracing. Could you share with us how exactly pytorch model was created before passing to convert_model() or provide a working script?

Thank you for looking into this. You can refer to my forked repo, which includes a script named ov_convert.sh to directly convert the model.

pkowalc1 · 2025-02-05T08:15:24Z

hi @schrodingho,
I have optimized GridSample a little bit - currently it is ~21-36x faster on a770 and on your model according to benchmark_app(latency is 3-5ms). It is still WIP, but, if you want to, you can try it now, by compiling ov from this branch:
gpu_grid_sample_opt [commit id: 3752b8a]

Would be great if you could confirm that it helps in your case/env/etc. The code should be correct at this point, also it should be as stable numerically as ref version, so you shouldn't see any difference in output other than getting it faster.

I will try to optimize it further on this branch, so it may take a while before it is merged to master.

pkowalc1 · 2025-02-07T14:03:38Z

hi @schrodingho,
Did you manage to compile and run ov from specified branch? If you need any assistance - don't hesitate to ask!

schrodingho · 2025-02-10T05:32:56Z

hi @schrodingho, Did you manage to compile and run ov from specified branch? If you need any assistance - don't hesitate to ask!

Hi @pkowalc1 ,
I successfully compiled your optimized OpenVINO and observed a significant improvement in GridSample performance (20-43x) on this forked branch. Huge thanks for your efforts!

I have two more questions:

I noticed that aten::grid_sampler/GridSample_3 (as shown in the figure below) takes much longer than the other two, even though it processes the same input size as GridSample_1 and GridSample_2. Do you have any insights into why this might be the case?
I’m also working on another optimized unimatch model. After converting it and running it on dGPU (A770) with the optimized OpenVINO, I observed a noticeable latency reduction. However, when I checked the benchmark report, it seems that not all GridSample operations are utilizing the optimized kernel. Then I switched the running device to iGPU (UHD770), all GridSample operations used the optimized kernel grid_sample_opt_bilinear_zeros__f16. Do you have any insights on why this might be happening?

A770

UHD770

Thanks again for your great work 👍

pkowalc1 · 2025-02-10T08:43:54Z

Hi @schrodingho,
Ok, great to hear that!

ad. 1: Basically this op implements indirect access to memory. So it's performance depends on contents on grid input. If the access pattern is friendly(basically, dense, linear) than it will be faster, otherwise might be slower. Currently, this op might up to 3x slower in the worst case(random access) compared to the best case. So I suspect that the difference you see is caused by the 'less friendly' access pattern and as such it will be hard to optimize it. However I am still working on that op and it will be generally faster soon.

ad. 2: Yes, so basically I have optimized very narrow case of this op - currently optimized implementation will be launched only when you have bilinear filtering with zero padding. In all other cases - ref impl will be launched at this point. But I am still working on this kernel, so will try to make it faster for all possible cases.
Now, you mentioned that on uhd all grid ops use optimized version while on A770 there is one which does not use it. There are at least two possible expatiations for it:
1). There is sth different at graph optimization level - so that a770 and uhd runs a little bit different models. E.g. some data layouts are changed. As I said above, currently optimized supports only some cases of grid sample parameter spectrum.
2) Both uhd and a770 use ref for some grid op, but it kicks a770 harder than uhd. It is because different hardware may have different characteristics etc. In this case, ref imp is executed on uhd, but profiling shows it at n-th place, which on a 770 it is on the 1st place.
I suggest you to wait till the work is done(basically when #28670 is merged to master). If the problem still exists, please notify us and sb will look at it.

Can you please send me what parameter(s) do you set for Grid op in you models? I will focus on them in the first place.

schrodingho · 2025-02-10T09:32:11Z

Hi @pkowalc1,
Thank you for your detailed explanation! I really appreciate it.

Regarding the parameters of the GridSample ops in my model, all use the same settings: mode="bilinear", padding_mode="zeros", align_corners=True.

Let me know if you need any additional details for the optimization :)

schrodingho added performance Performance related topics support_request labels Jan 15, 2025

mlukasze assigned pkowalc1 Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

schrodingho commented Jan 15, 2025 •

edited

Loading

dnkurek commented Jan 15, 2025

schrodingho commented Jan 16, 2025

dnkurek commented Jan 16, 2025

mlukasze commented Jan 24, 2025 •

edited

Loading

schrodingho commented Jan 27, 2025

pkowalc1 commented Feb 5, 2025 •

edited

Loading

pkowalc1 commented Feb 7, 2025

schrodingho commented Feb 10, 2025

pkowalc1 commented Feb 10, 2025

schrodingho commented Feb 10, 2025

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

Comments

schrodingho commented Jan 15, 2025 • edited Loading

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Target Platform

Performance issue description

Step-by-step reproduction

Issue submission checklist

dnkurek commented Jan 15, 2025

schrodingho commented Jan 16, 2025

dnkurek commented Jan 16, 2025

mlukasze commented Jan 24, 2025 • edited Loading

schrodingho commented Jan 27, 2025

pkowalc1 commented Feb 5, 2025 • edited Loading

pkowalc1 commented Feb 7, 2025

schrodingho commented Feb 10, 2025

pkowalc1 commented Feb 10, 2025

schrodingho commented Feb 10, 2025

schrodingho commented Jan 15, 2025 •

edited

Loading

mlukasze commented Jan 24, 2025 •

edited

Loading

pkowalc1 commented Feb 5, 2025 •

edited

Loading