Why GPU-based algorithms are faster

Question

Why GPU-based algorithms are faster

I just implemented an algorithm on a GPU that calculates the difference btw of consecutive array indices. I compared it with a processor-based implementation and noticed that for a large array, the GPU-based implementation is faster.

I'm curious WHY the GPU implementation is actually faster. Please note that I know the superficial justification that the GPU has several cores and thus can perform the operation in parallel, that is, instead of visiting each index in series, we can assign a thread to calculate the difference for each index.

But can someone tell me a deeper reason why the GPU is faster. What is so different from their architecture is that they can win based on the CPU.

+4

gpgpu cuda nvidia

Programmer Feb 11 '12 at 8:48

source share

3 answers

The real reason is that the GPU has not only several cores, but also has many cores , usually hundreds of them! However, each GPU core is much slower than a low-performance processor.

But the programming mode is not at all like multi-core processors. Thus, most programs cannot be ported or use GPUs.

+4

Basile starynkevitch Feb 11 '12 at 8:52

source share

Although some answers have already been given here, and this is an old thread, I just thought I would add this for posterity and what not:

The main reason that the CPU and GPU are very different in performance for certain problems is the design decisions regarding the allocation of chip resources. The processor allocates most of its chip space for large caches, instruction decoders, peripherals and system control, etc. Their cores are much more complex and work with much higher clock frequencies (which leads to an increase in heat to the core, which should be dissipated). In contrast, GPUs devote their chip space to packing as many floating-point ALUs on the chip as possible, as they can go. The initial goal of the GPU was to multiply matrices as quickly as possible (because this is the main type of computation related to graphic rendering.) Since matrix multiplication is an awkward parallel task (for example, each output value is calculated completely independently of any other output value), and the code path for each of these calculations is identical, the space of the microcircuit can be saved if several ALUs follow the instructions decoded by one decoder of commands, since they all execute the same nor the same operations at the same time. On the contrary, each of the CPU cores must have its own separate instruction decoder, since the cores do not correspond to identical codes, which makes each of the CPU cores much larger on the matrix than the GPU cores. Since the primary calculations performed by matrix multiplication are floating point multiplication and floating point additions, the GPUs are implemented in such a way that each of them is single-ended operations and actually even contains a smooth multiplication and addition command that multiplies two numbers and adds the result to third number in one cycle. This is much faster than a regular processor, where floating point multiplication is often multi-cylinder. Again, the trade-off here is that the chip space is dedicated to floating-point math equipment, while other instructions (such as the control flow) are often much slower on the core than on the processor, and sometimes they just don't exist on the GPU at all.

In addition, since GPU cores operate at much lower clock speeds than regular CPU cores and do not contain as much complex circuitry, they do not produce so much heat per core (or use so much energy per core). This allows, moreover, they must be packed into the same space without overheating the chip, and also allow GPUs with 1000 cores to have the same power and cooling requirements for a processor with 4 or 8 cores.

+2

reirab Dec 18 '12 at 19:34

source share

Kos · Accepted Answer · 2012-02-11T09:09:38+0000

They do not work faster, usually.

The fact is that some algorithms are better suited for the processor, some are better suited for the GPU.

The GPU execution model is different (see SIMD), the memory model is different, the instruction set is different ... The whole architecture is different.

There is no obvious way to compare the processor with the GPU. You can only discuss whether (and why) the CPU implementation of the CPU algorithm is faster or slower than the implementation B in this algorithm.

This turned out to be fuzzy, so the tip of the iceberg for specific reasons would be as follows: The strength of the CPU is random access to memory, branch prediction, etc. The GPU is different in that with a large number of calculations with high data locality, so your implementation can provide a good ratio between memory access and memory. SIMD makes GPU implementations slower than CPUs, where, for example, there are many unpredictable bindings to many codes.

Why GPU-based algorithms are faster

More articles: