How to add matrix with CUDA C

I am writing simplified code about adding elements from two matrices A and B; the code is pretty simple and inspired by the example in Chapter 2 of the CUDA C Programming Guide .

#include <stdio.h> #include <stdlib.h> #define N 2 __global__ void MatAdd(int A[][N], int B[][N], int C[][N]){ int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main(){ int A[N][N] = {{1,2},{3,4}}; int B[N][N] = {{5,6},{7,8}}; int C[N][N] = {{0,0},{0,0}}; int (*pA)[N], (*pB)[N], (*pC)[N]; cudaMalloc((void**)&pA, (N*N)*sizeof(int)); cudaMalloc((void**)&pB, (N*N)*sizeof(int)); cudaMalloc((void**)&pC, (N*N)*sizeof(int)); cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(pB, B, (N*N)*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(pC, C, (N*N)*sizeof(int), cudaMemcpyHostToDevice); int numBlocks = 1; dim3 threadsPerBlock(N,N); MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C); cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost); int i, j; printf("C = \n"); for(i=0;i<N;i++){ for(j=0;j<N;j++){ printf("%d ", C[i][j]); } printf("\n"); } cudaFree(pA); cudaFree(pB); cudaFree(pC); printf("\n"); return 0; } 

when I run it, I keep getting the initial matrix C = [0 0; 0 0] instead of adding elements (i, j) of two matrices A and B; I previously made another example about adding elements from two arrays and it seems to work fine; however this time I do not know why this does not work.

I believe that something is wrong with the cudaMalloc team, I do not know what else could be.

Any ideas?

+5
source share
1 answer

MatAdd<<<numBlocks,threadsPerBlock>>>(pA,pB,pC); instead of MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C); solves the problem.

The reason is that A,B and C are allocated to the CPU, and pA,pB and pC are allocated from the GPU using CudaMalloc() . When pA,pB and pC are highlighted, the values ​​are transferred from the CPU to the GPU to cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);

Then the addition is done on the GPU, i.e. with pA,pB and pC . To use printf , the result of pC sent from the GPU to the CPU via cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);

Think that the CPU cannot see pA , and the GPU cannot see A

+4
source

Source: https://habr.com/ru/post/1206098/


All Articles