Performance Issues: Single-core vs. Single-Core CUDA

Question

Performance Issues: Single-core vs. Single-Core CUDA

I wanted to compare the speed of one core of an Intel processor with the speed of one core of an NVIDIA GPU (i.e. one CUDA code, one thread). I implemented the following naive 2d image convolution algorithm:

void convolution_cpu(uint8_t* res, uint8_t* img, uint32_t img_width, uint32_t img_height, uint8_t* krl, uint32_t krl_width, uint32_t krl_height) { int32_t center_x = krl_width / 2; int32_t center_y = krl_height / 2; int32_t sum; int32_t fkx,fky; int32_t xx,yy; float krl_sum = 0; for(uint32_t i = 0; i < krl_width*krl_height; ++i) krl_sum += krl[i]; float nc = 1.0f/krl_sum; for(int32_t y = 0; y < (int32_t)img_height; ++y) { for(int32_t x = 0; x < (int32_t)img_width; ++x) { sum = 0; for(int32_t ky = 0; ky < (int32_t)krl_height; ++ky) { fky = krl_height - 1 - ky; for(int32_t kx = 0; kx < (int32_t)krl_width; ++kx) { fkx = krl_width - 1 - kx; yy = y + (ky - center_y); xx = x + (kx - center_x); if( yy >= 0 && yy < (int32_t)img_height && xx >= 0 && xx < (int32_t)img_width ) { sum += img[yy*img_width+xx]*krl[fky*krl_width+fkx]; } } } res[y*img_width+x] = sum * nc; } } }

The algorithm is the same for both the CPU and the GPU. I also made another version of the GPU, which is almost the same as above. The only difference is that I pass the img and krl to shared memory before using them.

I used 2 images of 52x52 each, and I got the following view:

CPU: 10ms
GPU: 1338ms
GPU (smem): 1165ms

The processor is the Intel Xeon X5650 2.67GHz, and the GPU is the nVidia Tesla C2070.

Why am I getting such a performance difference? It seems that one CUDA core is 100 times slower for this code! Can someone explain to me why? The reasons I can think of are

higher processor frequency
The CPU performs branch prediction.
Does the processor have better caching mechanisms?

What do you think is the main problem that causes this huge performance difference?

Keep in mind that I want to compare the speed between one processor thread and one GPU thread. I am not trying to evaluate the performance of GPUs. I know that this is not the right way to do convolution on the GPU.

+4

performance gpgpu cuda convolution

Astrone Jun 12 '13 at 4:40

source share

2 answers

Why is someone trying to do this? Sorry, but I don’t understand ... you can (and absolutely SHOULD) run thousands of GPU threads instead of one! If you still think you want to create a naive implementation, you can still avoid the two most external for-loops.

What is the meaning of this?

PS: btw, if CPU-Thread is not faster than ONE GPU-Thread, why would anyone else use them for calculations?

-6

Eru Iluvatar Jun 12 '13 at 18:40

source share

Mr. · Accepted Answer · 2014-03-10T06:31:53+0000

I'm trying to explain, maybe this will work for you.

The CPU acts as a host, and the GPU acts as a device.

To start a thread on the GPU, the CPU copies all the data (Computation + DATA that will be used to calculate it) to the GPU. This copy time is always longer than the calculation time. Because the calculation is performed in the ALU-arithmetic and logical unit. These are just some of the instructions. But copying takes longer.

Thus, when you start only one thread in the CPU, the processor has all the data in its own memory that has its own cache, as well as branch prediction, prefetching, reordering of micro-operations, 10-fold acceleration L1, 10x faster than L2, the ability to send 6 times more instructions per cycle, 4.6 times the core frequency.

But when it comes to the fact that you want to run a stream on the GPU, it first copies the data to the GPU's memory. This time more time. Secondly, GPU cores run a thread grid in time. But for this we need to split the data so that each thread gets access to one element of the array. In your example, these are img and krl arrays.

There is also a profiler for nvidia GPUs. Remove codes such as printouts or prints in your code, if they exist, and try profiling your exe. It will show you the copy time and calculation time as in ms.

Loop parallelization:. When you run two loops to calculate an image using image_width and image_height, more clock cycles are needed, like at the command level, which goes through the counters. But when you port them to the GPU, you use threadid.x and threadid.y and a grid of 16 or 32 threads that work in only one clock cycle in the same GPU core. This means that it computes 16 or 32 array elements in a single clock cycle since it has more ALUs (if there are no dependencies and the data is well separated)

In your convolution algorithm, you support cycles in processors, but in GPUs, if you run the same loops that it is not useful, since GPU stream 1 will again work as CPU stream 1. As well as the overhead of memory caches, copying memory , data sharing, etc.

Hope this makes you understand ...

Performance Issues: Single-core vs. Single-Core CUDA

More articles: