Replies: 8 comments
-
PyOpenCL is not involved in the execution of the kernels. The actual on-device kernel execution time should be exactly the same between a C program using OpenCL and a Python program using PyOpenCL. If that's not the case, something is very wrong. This means that PyOpenCL is a good fit if your code spends most of its time executing kernel code (IMO: as it should!). If the kernel times are so short that kernels complete faster than Python can enqueue them, then there can be a performance impact. |
Beta Was this translation helpful? Give feedback.
-
my application spends most of the time on device (it is called computational fluid dynamics, in case you wonder 😄 ). I only counted the time the kernel is executed. To be specific, kernels[1].set_args(...)
# enqueue
cl.enqueue_nd_range_kernel(queue, kernels[1],...) I compiled all my kernels from program and save them into a list Interestingly, I got same results between C OpenCL and PyOpenCL. So, I do not think I made any programming mistakes. But I might not use the PyOpenCL in the way it is supposed to. Could you please comment on which part I should examine my code? P.S. would it make any difference if i do the following kernels[1](queue, group_size,...) |
Beta Was this translation helpful? Give feedback.
-
Are you saying your kernel times are different? Switch your command queue to enable profiling and get kernel execution times in both settings. They should match pretty closely. |
Beta Was this translation helpful? Give feedback.
-
@inducer my bad. I think I confused |
Beta Was this translation helpful? Give feedback.
-
The Python bits (setting and preparing arguments) is slower, but this time can (and should be) hidden by kernel execution, which occurs asynchronously on the device. |
Beta Was this translation helpful? Give feedback.
-
Make sense! |
Beta Was this translation helpful? Give feedback.
-
@inducer consider the following code queue = cl.Queue(...)
t1 = time.time()
kernels[1].set_args(...)
# enqueue
event = cl.enqueue_nd_range_kernel(queue, kernels[1],...)
event.wait()
t2 = time.time()
total_time= t2-t1
execution_time= (event.profile.end-event.profile.start)*1e-9 I searched online and found that the difference between Is there any way we can reduce this data-transfer time? I have a while loop, in which I execute the above code for thousands of times. Here is an example code. On my computer, the total time cost is about 0.004 seconds while the execution time cost is 0.0007 seconds. This seems indicating the most of the time is spent on import numpy as np
import pyopencl as cl
import time
# get platforms and devices
platforms = cl.get_platforms()
devices = platforms[0].get_devices()
# create context
context = cl.Context(
devices=[devices[0]], properties=[(cl.context_properties.PLATFORM, platforms[0])],
)
program = cl.Program(
context,
"""
__kernel void sum_matrices(
__global const float *a_g, __global const float *b_g, __global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
""",
).build()
# create queue
queue = cl.CommandQueue(
context, properties=cl.command_queue_properties.PROFILING_ENABLE
)
a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)
mf = cl.mem_flags
a_g = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)
res_g = cl.Buffer(context, mf.WRITE_ONLY, a_np.nbytes)
# knl = program.sum_matrices
knl = getattr(program, "sum_matrices")
index = 0
start = time.time()
execution_time = 0.0
while index < 100:
knl.set_args(a_g, b_g, res_g)
event = cl.enqueue_nd_range_kernel(queue, knl, a_np.shape, None)
event.wait()
execution_time += (event.profile.end - event.profile.start) * 1e-9
# print(f"\t execution: {execution_time}")
index += 1
end = time.time()
print(f"execution time cost: {execution_time}")
print(f"total time cost: {end - start}") |
Beta Was this translation helpful? Give feedback.
-
Not really. If you do the following, you'll see that there's a difference even without those.
|
Beta Was this translation helpful? Give feedback.
-
Hi PyOpenCL developers and experts,
I do not see a PyOpenCL forum online, so I post my question here.
I am working on a project to port my
C OpenCL
code toPyOpenCL
. Is the speed of executing opencl kernels slowers in PyOpenCL than in C?My tests show that my pyopencl is like 10-20% slower than c opencl. I read a post online saying pyopencl is like 28 times slower than c.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions