Is PyopenCL slower than its counterpart C OpenCL #612

doetools · 2021-06-28T14:49:23Z

doetools
Jun 28, 2021

Hi PyOpenCL developers and experts,

I do not see a PyOpenCL forum online, so I post my question here.
I am working on a project to port my C OpenCL code to PyOpenCL. Is the speed of executing opencl kernels slowers in PyOpenCL than in C?

My tests show that my pyopencl is like 10-20% slower than c opencl. I read a post online saying pyopencl is like 28 times slower than c.

Thanks!

inducer · 2021-06-28T15:20:11Z

inducer
Jun 28, 2021
Maintainer

PyOpenCL is not involved in the execution of the kernels. The actual on-device kernel execution time should be exactly the same between a C program using OpenCL and a Python program using PyOpenCL. If that's not the case, something is very wrong.

This means that PyOpenCL is a good fit if your code spends most of its time executing kernel code (IMO: as it should!). If the kernel times are so short that kernels complete faster than Python can enqueue them, then there can be a performance impact.

0 replies

doetools · 2021-06-28T15:29:13Z

doetools
Jun 28, 2021
Author

@inducer

my application spends most of the time on device (it is called computational fluid dynamics, in case you wonder 😄 ). I only counted the time the kernel is executed. To be specific,

    kernels[1].set_args(...)
    # enqueue
    cl.enqueue_nd_range_kernel(queue, kernels[1],...)

I compiled all my kernels from program and save them into a list kernels.

Interestingly, I got same results between C OpenCL and PyOpenCL. So, I do not think I made any programming mistakes. But I might not use the PyOpenCL in the way it is supposed to.

Could you please comment on which part I should examine my code?

P.S. would it make any difference if i do the following

kernels[1](queue, group_size,...)

0 replies

inducer · 2021-06-28T15:35:48Z

inducer
Jun 28, 2021
Maintainer

Are you saying your kernel times are different? Switch your command queue to enable profiling and get kernel execution times in both settings. They should match pretty closely.

0 replies

doetools · 2021-06-28T15:38:07Z

doetools
Jun 28, 2021
Author

@inducer my bad. I think I confused kernel time and execution time. So, according to your experience, is the kernel time (time cost of set args, enqueue, and execution) of PyOpenCL slower (longer) than C OpenCL?

0 replies

inducer · 2021-06-28T15:53:57Z

inducer
Jun 28, 2021
Maintainer

The Python bits (setting and preparing arguments) is slower, but this time can (and should be) hidden by kernel execution, which occurs asynchronously on the device.

0 replies

doetools · 2021-06-28T16:11:18Z

doetools
Jun 28, 2021
Author

The Python bits (setting and preparing arguments) is slower, but this time can (and should be) hidden by kernel execution, which occurs asynchronously on the device.

Make sense!

0 replies

doetools · 2021-06-28T19:30:59Z

doetools
Jun 28, 2021
Author

@inducer consider the following code

queue = cl.Queue(...)
t1 = time.time()
kernels[1].set_args(...)
# enqueue
event = cl.enqueue_nd_range_kernel(queue, kernels[1],...)
event.wait()
t2 = time.time()

total_time= t2-t1
execution_time= (event.profile.end-event.profile.start)*1e-9

I searched online and found that the difference between total_time and execution_time is due to data transfer. What exactly does data transfer mean given that we have already created the buffers.

Is there any way we can reduce this data-transfer time? I have a while loop, in which I execute the above code for thousands of times.

Here is an example code. On my computer, the total time cost is about 0.004 seconds while the execution time cost is 0.0007 seconds. This seems indicating the most of the time is spent on knl.set_args and cl.enqueue_nd_range_kernel. Any idea how to speed up that?

import numpy as np
import pyopencl as cl
import time

# get platforms and devices
platforms = cl.get_platforms()
devices = platforms[0].get_devices()
# create context
context = cl.Context(
    devices=[devices[0]], properties=[(cl.context_properties.PLATFORM, platforms[0])],
)
program = cl.Program(
    context,
    """
__kernel void sum_matrices(
    __global const float *a_g, __global const float *b_g, __global float *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}
""",
).build()
# create queue
queue = cl.CommandQueue(
    context, properties=cl.command_queue_properties.PROFILING_ENABLE
)

a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)
mf = cl.mem_flags
a_g = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

res_g = cl.Buffer(context, mf.WRITE_ONLY, a_np.nbytes)
# knl = program.sum_matrices
knl = getattr(program, "sum_matrices")
index = 0
start = time.time()
execution_time = 0.0
while index < 100:
    knl.set_args(a_g, b_g, res_g)
    event = cl.enqueue_nd_range_kernel(queue, knl, a_np.shape, None)
    event.wait()
    execution_time += (event.profile.end - event.profile.start) * 1e-9
    # print(f"\t execution: {execution_time}")
    index += 1
end = time.time()
print(f"execution time cost: {execution_time}")
print(f"total time cost: {end - start}")

0 replies

isuruf · 2021-07-05T20:28:58Z

isuruf
Jul 5, 2021
Collaborator

This seems indicating the most of the time is spent on knl.set_args and cl.enqueue_nd_range_kernel. Any idea how to speed up that?

Not really. If you do the following, you'll see that there's a difference even without those.
I'd guess the difference is in the time it takes for the opencl implementation to schedule and run the kernel.
Do you have a equivalent C function that reproduces this?

start = time.time()
execution_time = 0.0
events = []
while index < 100:
    knl.set_args(a_g, b_g, res_g)
    event = cl.enqueue_nd_range_kernel(queue, knl, a_np.shape, None)
    events.append(event)
    index += 1

end = time.time()
print(f"total setup cost: {end - start}")

start = time.time()
for event in events:    
    event.wait()
    execution_time += (event.profile.end - event.profile.start) * 1e-9
end = time.time()
print(f"execution time cost: {execution_time}")
print(f"total time cost: {end - start}")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is PyopenCL slower than its counterpart C OpenCL #612

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is PyopenCL slower than its counterpart C OpenCL #612

doetools Jun 28, 2021

Replies: 8 comments

inducer Jun 28, 2021 Maintainer

doetools Jun 28, 2021 Author

inducer Jun 28, 2021 Maintainer

doetools Jun 28, 2021 Author

inducer Jun 28, 2021 Maintainer

doetools Jun 28, 2021 Author

doetools Jun 28, 2021 Author

isuruf Jul 5, 2021 Collaborator

doetools
Jun 28, 2021

inducer
Jun 28, 2021
Maintainer

doetools
Jun 28, 2021
Author

inducer
Jun 28, 2021
Maintainer

doetools
Jun 28, 2021
Author

inducer
Jun 28, 2021
Maintainer

doetools
Jun 28, 2021
Author

doetools
Jun 28, 2021
Author

isuruf
Jul 5, 2021
Collaborator