Does my GTX680 really work

I'm trying to test my GTX680 for computing performance, as I have some doubts about how it really works. I was wondering, maybe someone could also test his GTX 680 if they give the same results, or maybe they say what can be done better to use more performance from the card.

I wrote this little program

#include <stdlib.h> #include <stdio.h> #include <float.h> #include "cuComplex.h" #include "time.h" #include "cuda_runtime.h" #include <iostream> using namespace std; __global__ void test(int loop, int *out) { register int a=0; for (int x=0;x<loop;x++) { a+=x*loop; } if (out!=NULL) *out=a; } int main(int argc, char *argv[]) { float timestamp; cudaEvent_t event_start,event_stop; // Initialise cudaDeviceReset(); cudaDeviceReset(); cudaSetDevice(0); cudaThreadSetCacheConfig(cudaFuncCachePreferShared); // Allocate and generate buffers cudaEventCreate(&event_start); cudaEventCreate(&event_stop); cudaEventRecord(event_start, 0); dim3 threadsPerBlock; dim3 blocks; int b=1000; threadsPerBlock.x=32; threadsPerBlock.y=32; threadsPerBlock.z=1; blocks.x=1; blocks.y=1000; blocks.z=1; test<<<blocks,threadsPerBlock,0>>>(300, NULL ); cudaEventRecord(event_stop, 0); cudaEventSynchronize(event_stop); cudaEventElapsedTime(&timestamp, event_start, event_stop); printf("Calculated in %f", timestamp); } 

Compiling with nvcc I get this PTX

 // // Generated by NVIDIA NVVM Compiler // Compiler built on Sat Sep 22 02:35:14 2012 (1348274114) // Cuda compilation tools, release 5.0, V0.2.1221 // .version 3.1 .target sm_30 .address_size 64 .file 1 "/tmp/tmpxft_00000e7b_00000000-9_perf.cpp3.i" .file 2 "/opt/home/daniel/a/perf.cu" .visible .entry _Z4testiPi( .param .u32 _Z4testiPi_param_0, .param .u64 _Z4testiPi_param_1 ) { .reg .pred %p<4>; .reg .s32 %r<15>; .reg .s64 %rd<3>; ld.param.u32 %r6, [_Z4testiPi_param_0]; ld.param.u64 %rd2, [_Z4testiPi_param_1]; cvta.to.global.u64 %rd1, %rd2; mov.u32 %r13, 0; .loc 2 12 1 setp.lt.s32 %p1, %r6, 1; mov.u32 %r14, %r13; mov.u32 %r11, %r13; @%p1 bra BB0_2; BB0_1: .loc 2 14 1 mad.lo.s32 %r14, %r11, %r6, %r14; .loc 2 12 20 add.s32 %r11, %r11, 1; .loc 2 12 1 setp.lt.s32 %p2, %r11, %r6; mov.u32 %r13, %r14; @%p2 bra BB0_1; BB0_2: .loc 2 18 1 setp.eq.s64 %p3, %rd2, 0; @%p3 bra BB0_4; .loc 2 18 1 st.global.u32 [%rd1], %r13; BB0_4: .loc 2 21 2 ret; } 

The kernel runs at 1.936ms

My calculations show that GFLOPS performance was 1.1 TFLOP, only a third of the theoretical value of 3TFLOPS (Ref: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680 ) .. Why is it so slow?

The details of my calculations are as follows

 mad.lo.s32 %r14, %r11, %r6, %r14; //2 FLOPS .loc 2 12 20 add.s32 %r11, %r11, 1; //1 FLOP .loc 2 12 1 setp.lt.s32 %p2, %r11, %r6; //1 FLOP mov.u32 %r13, %r14; // 1 FLOP @%p2 bra BB0_1; //1 FLOP + 1 FLOP (just as a buffer as I don't know branching how much it takes) 

Total FLOPS for 1 iteration in a loop - 7 FLOPS

Given only iterations

We have 300 iterations per thread. We have 1024 * 1000 blocks.

Total iterations FLOPS = 300 * 1024 * 1000 * 7 = 2.15 GFLOPS

The total core time is 1.936 ms.

Therefore, throughput = 1.11 TFLOPS

Thank you in advance for your help.

Daniel

+4
source share
4 answers

This sample program is based on @Robert Crovella's answer. The core of Robert is limited by data dependencies. By reducing the data dependency between FMA teams, this core should achieve 2.4-2.5 TFLOPS on the GTX680.

The current implementation is the limited selection of commands and data. The kernel must be tuned to improve the achieved FLOPS by another 10%.

Nsight Visual Studio Edition 2.x and the new 3.0 RC candidate provide the metrics needed to analyze this kernel.

In 2.x and 3.0, you should use the following experiments to analyze the kernel:

  • Instructions statistics - SM activity - make sure all SMs are close to 100%
  • Emission efficiency - permissible deformations - on the Kepler. Allowed deformations per active cycle must be greater than 4 so that each warp scheduler issues an instruction every cycle.
  • Emission Efficiency - Issues with Stores - The Warp Efficiency Effect will determine how often each warp planner could not issue due to an insufficient number of suitable bases. If this is high, then the cause of the problem will help identify the limiter.
  • Achieved FLOP - this shows how the breakdown of the type and speed of floating point operations with one and two points performed by the kernel.

In the case of the core, Robert Execution Dependencies were extremely high since each command had read after write. By increasing the level of parallelism commands, we tripled performance. The kernel is now mostly limited to the team.

The new Nsight VSE 3.0 RC (available today) will also show the assembly or source code annotated for each instruction statistics, such as the number of commands executed and the number of active threads per instruction. In this example, the tool can be used to identify data dependencies and make sure that the compiler generates FMA instructions that are necessary to achieve more than 50% of theoretical FLOPS achievements.

 __global__ void test(float loop, float *out) { register float a=1.0f; register float b=1.0f; register float c=1.0f; register float d=1.0f; register float e=1.0f; register float f=1.0f; register float g=1.0f; register float h=1.0f; for (float x=0;x<loop;x++) { a+=x*loop; b+=x*loop; c+=x*loop; d+=x*loop; e+=x*loop; f+=x*loop; g+=x*loop; h+=x*loop; a+=x*loop; b+=x*loop; c+=x*loop; d+=x*loop; e+=x*loop; f+=x*loop; g+=x*loop; h+=x*loop; a+=x*loop; b+=x*loop; c+=x*loop; d+=x*loop; e+=x*loop; f+=x*loop; g+=x*loop; h+=x*loop; a+=x*loop; b+=x*loop; c+=x*loop; d+=x*loop; e+=x*loop; f+=x*loop; g+=x*loop; h+=x*loop; a+=x*loop; b+=x*loop; c+=x*loop; d+=x*loop; e+=x*loop; f+=x*loop; g+=x*loop; h+=x*loop; } if (out!=NULL) *out=a+b+c+d+e+f+g+h; } int main(int argc, char *argv[]) { float timestamp; cudaEvent_t event_start,event_stop; // Initialise cudaDeviceReset(); cudaSetDevice(0); cudaThreadSetCacheConfig(cudaFuncCachePreferShared); // Allocate and generate buffers cudaEventCreate(&event_start); cudaEventCreate(&event_stop); cudaEventRecord(event_start, 0); dim3 threadsPerBlock; dim3 blocks; threadsPerBlock.x=32; threadsPerBlock.y=32; threadsPerBlock.z=1; blocks.x=1; blocks.y=1000; blocks.z=1; test<<<blocks,threadsPerBlock,0>>>(30,NULL); cudaEventRecord(event_stop, 0); cudaEventSynchronize(event_stop); cudaEventElapsedTime(&timestamp, event_start, event_stop); printf("Calculated in %f\n", timestamp); } 
+3
source

I think the problem is that you are using integer multiplication. The 32-bit integer multiplication on the 3.0 computing architecture is only 1/6 of the 32-bit floating point bandwidth (see the table below, taken from the CUDA C Programming Guide version 5.5). Compare 32-bit integer performance with a 32-bit floating-point value for architecture 3.0.

Some other whole operations and type conversions, which are mainly used in computing applications, also reduce performance by 3.0.

enter image description here

+1
source

See if you get the best results with this code. This is just an example, it does not do the same as your code, and I think you will have to recount the failures.

 #include <stdio.h> using namespace std; __global__ void test(float loop, float *out) { register float a=1.0f; for (float x=0;x<loop;x++) { a+=x*loop; a+=x*loop; a+=x*loop; a+=x*loop; a+=x*loop; a+=x*loop; a+=x*loop; a+=x*loop; a+=x*loop; a+=x*loop; } if (out!=NULL) *out=a; } int main(int argc, char *argv[]) { float timestamp; cudaEvent_t event_start,event_stop; // Initialise cudaDeviceReset(); cudaSetDevice(0); cudaThreadSetCacheConfig(cudaFuncCachePreferShared); // Allocate and generate buffers cudaEventCreate(&event_start); cudaEventCreate(&event_stop); cudaEventRecord(event_start, 0); dim3 threadsPerBlock; dim3 blocks; threadsPerBlock.x=32; threadsPerBlock.y=32; threadsPerBlock.z=1; blocks.x=1; blocks.y=1000; blocks.z=1; test<<<blocks,threadsPerBlock,0>>>(30, NULL ); cudaEventRecord(event_stop, 0); cudaEventSynchronize(event_stop); cudaEventElapsedTime(&timestamp, event_start, event_stop); printf("Calculated in %f\n", timestamp); } 

When I compile this with arch = sm_20 or sm_30, I get 10 fma commands in the kernel loop line, without intermediate code. I think it will run faster and closer to peak theoretical flops than your code. Yes, there is a difference between integer OPs / second and OPC / second floating point. If you run this code, comment on it and let me know what your design performance is.

0
source

Your test kernel performs whole operations, not floating point operations. Therefore, FLOPS is the wrong metric for this kernel.

 FLOPS = FLoating point Operations Per Second 

Let's get back to the original question, although your core is slow because GPUs are optimized for floating point calculations, not for whole calculations.

To do the right test, try converting the test kernel to use floating point numbers rather than integers.

In addition, in a loop where you point FLOPS to steps, FLOPS does not make sense again, since this is the rate per second, and this is the whole operation. After you convert it, just consider them separate floating point operations, not floating point operations per second.

0
source

Source: https://habr.com/ru/post/1442725/


All Articles