warning: compiling with nvcc -O3 filename.cuwill pass the -O3 option only for host code.
To optimize the CUDA kernel code, you must pass the optimization flags to the PTX compiler, for example:
nvcc -Xptxas -O3,-v filename.cu
3 cuda ( ), -v , , ( ).
, nvcc-, - -use_fast_math, (. GPU ).
, , . , :
- Parallelism (ILP): CUDA - . , , NxN, TLP 2, NxM- ( M = N/2), threadIdx.y .
- :
-maxrrregcount=N. , ( , ). - :
#pragma unroll N , , CUDA. N 2,3,4. , . ILP, . - : ,
float A[N],B[N], float2 AB[N]. / .
, , , , . nVIDIA, .