MatAdd<<<numBlocks,threadsPerBlock>>>(pA,pB,pC); instead of MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C); solves the problem.
The reason is that A,B and C are allocated to the CPU, and pA,pB and pC are allocated from the GPU using CudaMalloc() . When pA,pB and pC are highlighted, the values ββare transferred from the CPU to the GPU to cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
Then the addition is done on the GPU, i.e. with pA,pB and pC . To use printf , the result of pC sent from the GPU to the CPU via cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);
Think that the CPU cannot see pA , and the GPU cannot see A
source share