I'm trying to test my GTX680 for computing performance, as I have some doubts about how it really works. I was wondering, maybe someone could also test his GTX 680 if they give the same results, or maybe they say what can be done better to use more performance from the card.
I wrote this little program
#include <stdlib.h> #include <stdio.h> #include <float.h> #include "cuComplex.h" #include "time.h" #include "cuda_runtime.h" #include <iostream> using namespace std; __global__ void test(int loop, int *out) { register int a=0; for (int x=0;x<loop;x++) { a+=x*loop; } if (out!=NULL) *out=a; } int main(int argc, char *argv[]) { float timestamp; cudaEvent_t event_start,event_stop; // Initialise cudaDeviceReset(); cudaDeviceReset(); cudaSetDevice(0); cudaThreadSetCacheConfig(cudaFuncCachePreferShared); // Allocate and generate buffers cudaEventCreate(&event_start); cudaEventCreate(&event_stop); cudaEventRecord(event_start, 0); dim3 threadsPerBlock; dim3 blocks; int b=1000; threadsPerBlock.x=32; threadsPerBlock.y=32; threadsPerBlock.z=1; blocks.x=1; blocks.y=1000; blocks.z=1; test<<<blocks,threadsPerBlock,0>>>(300, NULL ); cudaEventRecord(event_stop, 0); cudaEventSynchronize(event_stop); cudaEventElapsedTime(×tamp, event_start, event_stop); printf("Calculated in %f", timestamp); }
Compiling with nvcc I get this PTX
// // Generated by NVIDIA NVVM Compiler // Compiler built on Sat Sep 22 02:35:14 2012 (1348274114) // Cuda compilation tools, release 5.0, V0.2.1221 // .version 3.1 .target sm_30 .address_size 64 .file 1 "/tmp/tmpxft_00000e7b_00000000-9_perf.cpp3.i" .file 2 "/opt/home/daniel/a/perf.cu" .visible .entry _Z4testiPi( .param .u32 _Z4testiPi_param_0, .param .u64 _Z4testiPi_param_1 ) { .reg .pred %p<4>; .reg .s32 %r<15>; .reg .s64 %rd<3>; ld.param.u32 %r6, [_Z4testiPi_param_0]; ld.param.u64 %rd2, [_Z4testiPi_param_1]; cvta.to.global.u64 %rd1, %rd2; mov.u32 %r13, 0; .loc 2 12 1 setp.lt.s32 %p1, %r6, 1; mov.u32 %r14, %r13; mov.u32 %r11, %r13; @%p1 bra BB0_2; BB0_1: .loc 2 14 1 mad.lo.s32 %r14, %r11, %r6, %r14; .loc 2 12 20 add.s32 %r11, %r11, 1; .loc 2 12 1 setp.lt.s32 %p2, %r11, %r6; mov.u32 %r13, %r14; @%p2 bra BB0_1; BB0_2: .loc 2 18 1 setp.eq.s64 %p3, %rd2, 0; @%p3 bra BB0_4; .loc 2 18 1 st.global.u32 [%rd1], %r13; BB0_4: .loc 2 21 2 ret; }
The kernel runs at 1.936ms
My calculations show that GFLOPS performance was 1.1 TFLOP, only a third of the theoretical value of 3TFLOPS (Ref: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680 ) .. Why is it so slow?
The details of my calculations are as follows
mad.lo.s32 %r14, %r11, %r6, %r14; //2 FLOPS .loc 2 12 20 add.s32 %r11, %r11, 1; //1 FLOP .loc 2 12 1 setp.lt.s32 %p2, %r11, %r6; //1 FLOP mov.u32 %r13, %r14; // 1 FLOP @%p2 bra BB0_1; //1 FLOP + 1 FLOP (just as a buffer as I don't know branching how much it takes)
Total FLOPS for 1 iteration in a loop - 7 FLOPS
Given only iterations
We have 300 iterations per thread. We have 1024 * 1000 blocks.
Total iterations FLOPS = 300 * 1024 * 1000 * 7 = 2.15 GFLOPS
The total core time is 1.936 ms.
Therefore, throughput = 1.11 TFLOPS
Thank you in advance for your help.
Daniel