Optimal GPU Performance


I was asked to measure how “efficiently” my GPU code uses / that% of maximum performance is an achievement of the algorithms. I am not sure how to do this. So far, I mostly had timers placed in my code and measuring Execution. How can I compare this with optimal performance and find what could be the neck of a bottle? (I heard about the visual profiler, but could not get it to work. It continues to give me the error “I can’t load the result”).

+4
source share
3 answers

Each card has maximum memory bandwidth and processing speed. For example, the throughput of the GTX 480 is 177.4 GB / s. You will need the specifications for your card.

The first thing to decide is whether your code is related to memory or computation. If this is clearly one or the other, it will help you focus on the right “performance” for the measurement. If your program is memory related, you need to compare your bandwidth with the maximum card bandwidth.

You can calculate the memory bandwidth by calculating the amount of memory you read / write and divide by runtime (I use cuda events for synchronization). Here is a good example of calculating bandwidth efficiency (see Document for parallel reduction) and its use for kernel testing.


  • I'm not very good at determining performance if you are tied to ALU instead. You can probably count (or profile) the number of instructions, but what is the maximum card?

  • I'm also not sure what to do in the likely case when your kernel is between the memory boundaries and the ALU binding.

Anyone ...?

0
source

Typically, "effective" is likely to measure the number of memory and GPU cycles (average, min, max) of your program. Then the measure of efficiency will be avg (mem) / shared memory over a period of time, etc. With AVG (GPU) / Max GPU loops.

Then I compared these metrics with metrics from some GPU test suites (which can be considered quite effective when using most of the GPU). Or you can measure some random GPU programs of your choice. That would be the way I would do it, but I never thought to try such luck!

As for bottlenecks and "optimal" performance. These are probably NP-complete problems that no one can help you with. Exit the old profiler and debuggers and start your journey through your code.

0
source

It is not possible to help with the profiler and microoptimization, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which tries to evaluate how your CUDA code uses hardware resources based on these values:

Threads Per Block Registers Per Thread Shared Memory Per Block (bytes) 
0
source

Source: https://habr.com/ru/post/1340100/


All Articles