Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

Open
3 tasks done
schrodingho opened this issue Jan 15, 2025 · 10 comments
Open
3 tasks done
Assignees
Labels
performance Performance related topics support_request

Comments

@schrodingho
Copy link

schrodingho commented Jan 15, 2025

OpenVINO Version

Master Branch

Operating System

Windows System

Device used for inference

dGPU

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

https://github.com/autonomousvision/unimatch

Model quantization

No

Target Platform

OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0

Performance issue description

I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.

To reduce latency, I replaced the PyTorch function F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True) with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch for the original model and opt_unimatch for the modified one).

ori_unimatch:
Image

benchmark_app -m ori_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 75.63 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 3059.41 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 460.74 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to 
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            131 iterations
[ INFO ] Duration:         60207.90 ms
[ INFO ] Latency:
[ INFO ]    Median:        458.77 ms
[ INFO ]    Average:       458.70 ms
[ INFO ]    Min:           452.05 ms
[ INFO ]    Max:           465.72 ms
[ INFO ] Throughput:   4.35 FPS

opt_unimatch:
Image

benchmark_app -m opt_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 80.84 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 8530.97 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 242.54 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            278 iterations
[ INFO ] Duration:         60109.22 ms
[ INFO ] Latency:
[ INFO ]    Median:        215.37 ms
[ INFO ]    Average:       215.41 ms
[ INFO ]    Min:           205.85 ms
[ INFO ]    Max:           229.31 ms
[ INFO ] Throughput:   9.25 FPS

Step-by-step reproduction

  1. Clone the Unimatch.
  2. Download the pretrained model GMFlow-scale2-regrefine6-mixdata from the Model_Zoo and save it the pretrained folder.
  3. Follow the script gmflow_demo.sh in Scripts to run the model:
python main_flow.py \
--inference_dir demo/flow-davis \
--resume pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth \
--output_path output/gmflow-scale2-regrefine6-davis \
--padding_factor 16 \
--upsample_factor 4 \
--num_scales 2 \
--attn_splits_list 2 8 \
--corr_radius_list -1 4 \
--prop_radius_list -1 1 \
--reg_refine \
--num_reg_refine 2
  1. Add OpenVINO converting code in it and compile the model.
from pathlib import Path
import openvino as ov
ov_opt_device = "cpu"
model_without_ddp = model_without_ddp.to(ov_opt_device)

FIG_H = 320
FIG_W = 576

dummy_input1 = torch.randn(2, 3, FIG_H, FIG_W)
dummy_input2 = torch.randn(2, 3, FIG_H, FIG_W)

example_inputs = (
    dummy_input1,
    dummy_input2,
)
inputs = {
    "img0": dummy_input1,
    "img1": dummy_input2,
}
input_info = [(name, list(inp.shape)) for name, inp in inputs.items()]
UNIMATCH_OV_PATH = Path(f"opt_unimatch.xml")
model_without_ddp.eval()

with torch.no_grad():
    ov_model = ov.convert_model(model_without_ddp, input=input_info, example_input=example_inputs)
    ov.save_model(ov_model, UNIMATCH_OV_PATH, compress_to_fp16=True)
  1. Use benchmark_app to profile it.
benchmark_app -m %converted_model%.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"
  1. Change the F.gridsample in /unimatch/matching.py to this implementation, and redo the step 4 and 5.

Issue submission checklist

  • I'm reporting a performance issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.
@schrodingho schrodingho added performance Performance related topics support_request labels Jan 15, 2025
@dnkurek
Copy link
Contributor

dnkurek commented Jan 15, 2025

Hi, do you also have the same issue with the iGPU or CPU in your system?

Could be that simply grid_sample kernel was not optimized at all, since you are running the slow reference version. This would probably involve writing a opt version instead

@schrodingho
Copy link
Author

Hi, I just ran benchmarks on the iGPU (UHD 770) and the CPU (i9-13900K). The iGPU has the same issue (the grid_sample_ref is slow):

ori_unimatch
Image

opt_unimatch
Image

For CPU, it seems to have no such issue (original is better):

ori_unimatch
Image

opt_unimatch
Image

@dnkurek
Copy link
Contributor

dnkurek commented Jan 16, 2025

Yeah so it looks like grid_sample_ref needs to be optimized and make a grid_sample_opt version perhaps...

@mlukasze
Copy link
Contributor

mlukasze commented Jan 24, 2025

ref ticket: CVS-161002

hey @schrodingho
we've checked few things and it fails as "attn_splits_list" should not be None, but when we try to trace the model you suggested this attn_splits_list is not set, which leads to fail of tracing.
Could you share with us how exactly pytorch model was created before passing to convert_model() or provide a working script?

@schrodingho
Copy link
Author

ref ticket: CVS-161002

hey @schrodingho we've checked few things and it fails as "attn_splits_list" should not be None, but when we try to trace the model you suggested this attn_splits_list is not set, which leads to fail of tracing. Could you share with us how exactly pytorch model was created before passing to convert_model() or provide a working script?

Thank you for looking into this. You can refer to my forked repo, which includes a script named ov_convert.sh to directly convert the model.

@pkowalc1
Copy link
Contributor

pkowalc1 commented Feb 5, 2025

hi @schrodingho,
I have optimized GridSample a little bit - currently it is ~21-36x faster on a770 and on your model according to benchmark_app(latency is 3-5ms). It is still WIP, but, if you want to, you can try it now, by compiling ov from this branch:
gpu_grid_sample_opt [commit id: 3752b8a]

Would be great if you could confirm that it helps in your case/env/etc. The code should be correct at this point, also it should be as stable numerically as ref version, so you shouldn't see any difference in output other than getting it faster.

I will try to optimize it further on this branch, so it may take a while before it is merged to master.

@pkowalc1
Copy link
Contributor

pkowalc1 commented Feb 7, 2025

hi @schrodingho,
Did you manage to compile and run ov from specified branch? If you need any assistance - don't hesitate to ask!

@schrodingho
Copy link
Author

hi @schrodingho, Did you manage to compile and run ov from specified branch? If you need any assistance - don't hesitate to ask!

Hi @pkowalc1 ,
I successfully compiled your optimized OpenVINO and observed a significant improvement in GridSample performance (20-43x) on this forked branch. Huge thanks for your efforts!

I have two more questions:

  1. I noticed that aten::grid_sampler/GridSample_3 (as shown in the figure below) takes much longer than the other two, even though it processes the same input size as GridSample_1 and GridSample_2. Do you have any insights into why this might be the case?
    Image
  2. I’m also working on another optimized unimatch model. After converting it and running it on dGPU (A770) with the optimized OpenVINO, I observed a noticeable latency reduction. However, when I checked the benchmark report, it seems that not all GridSample operations are utilizing the optimized kernel. Then I switched the running device to iGPU (UHD770), all GridSample operations used the optimized kernel grid_sample_opt_bilinear_zeros__f16. Do you have any insights on why this might be happening?

Image
A770

Image
UHD770

Thanks again for your great work 👍

@pkowalc1
Copy link
Contributor

Hi @schrodingho,
Ok, great to hear that!

ad. 1: Basically this op implements indirect access to memory. So it's performance depends on contents on grid input. If the access pattern is friendly(basically, dense, linear) than it will be faster, otherwise might be slower. Currently, this op might up to 3x slower in the worst case(random access) compared to the best case. So I suspect that the difference you see is caused by the 'less friendly' access pattern and as such it will be hard to optimize it. However I am still working on that op and it will be generally faster soon.

ad. 2: Yes, so basically I have optimized very narrow case of this op - currently optimized implementation will be launched only when you have bilinear filtering with zero padding. In all other cases - ref impl will be launched at this point. But I am still working on this kernel, so will try to make it faster for all possible cases.
Now, you mentioned that on uhd all grid ops use optimized version while on A770 there is one which does not use it. There are at least two possible expatiations for it:
1). There is sth different at graph optimization level - so that a770 and uhd runs a little bit different models. E.g. some data layouts are changed. As I said above, currently optimized supports only some cases of grid sample parameter spectrum.
2) Both uhd and a770 use ref for some grid op, but it kicks a770 harder than uhd. It is because different hardware may have different characteristics etc. In this case, ref imp is executed on uhd, but profiling shows it at n-th place, which on a 770 it is on the 1st place.
I suggest you to wait till the work is done(basically when #28670 is merged to master). If the problem still exists, please notify us and sb will look at it.

Can you please send me what parameter(s) do you set for Grid op in you models? I will focus on them in the first place.

@schrodingho
Copy link
Author

Hi @pkowalc1,
Thank you for your detailed explanation! I really appreciate it.

Regarding the parameters of the GridSample ops in my model, all use the same settings: mode="bilinear", padding_mode="zeros", align_corners=True.

Let me know if you need any additional details for the optimization :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance related topics support_request
Projects
None yet
Development

No branches or pull requests

4 participants