I'm trying to explain, maybe this will work for you.
The CPU acts as a host, and the GPU acts as a device.
To start a thread on the GPU, the CPU copies all the data (Computation + DATA that will be used to calculate it) to the GPU. This copy time is always longer than the calculation time. Because the calculation is performed in the ALU-arithmetic and logical unit. These are just some of the instructions. But copying takes longer.
Thus, when you start only one thread in the CPU, the processor has all the data in its own memory that has its own cache, as well as branch prediction, prefetching, reordering of micro-operations, 10-fold acceleration L1, 10x faster than L2, the ability to send 6 times more instructions per cycle, 4.6 times the core frequency.
But when it comes to the fact that you want to run a stream on the GPU, it first copies the data to the GPU's memory. This time more time. Secondly, GPU cores run a thread grid in time. But for this we need to split the data so that each thread gets access to one element of the array. In your example, these are img and krl arrays.
There is also a profiler for nvidia GPUs. Remove codes such as printouts or prints in your code, if they exist, and try profiling your exe. It will show you the copy time and calculation time as in ms.
Loop parallelization:. When you run two loops to calculate an image using image_width and image_height, more clock cycles are needed, like at the command level, which goes through the counters. But when you port them to the GPU, you use threadid.x and threadid.y and a grid of 16 or 32 threads that work in only one clock cycle in the same GPU core. This means that it computes 16 or 32 array elements in a single clock cycle since it has more ALUs (if there are no dependencies and the data is well separated)
In your convolution algorithm, you support cycles in processors, but in GPUs, if you run the same loops that it is not useful, since GPU stream 1 will again work as CPU stream 1. As well as the overhead of memory caches, copying memory , data sharing, etc.
Hope this makes you understand ...