I work in parallel mode [this file] [1] on the GPU using [PTX file with matlab parallel.gpu.CUDAkernel] [2] my problem with [tensor product kron] [3] in my code, it should multiply two vectors kron(a,b) by multiplying each element of the first vector a=<32x1> by all elements of another vector b=<1x32> , and the size of the output vector will be k<32x32>=a.*b , I tried to write it to CPP and worked because I only care about summing all the elements of the 2d array, I thought I could simplify it like a 1D array, because m=sum(sum(kron(a,b))) is the code I'm working on
for(i=0;i<32;i++) for(j=0;j<32;j++) k[i*32+j]=a[i]*b[j]
this meant that the element a[i] th is multiplied by the eachelement in b , and although for the transition with 32 blocks each block has 32 threads, and the code should be
__global__ void myKrom(int* c,int* a, int*b) { int i=blockDim.x*blockIdx.x+threadIdx.x; while(i<32) { c[i]=a[blockIdx.x]+b[blockDim.x*blockIdx.x+threadIdx.x]; }
which should do the trick since blockIdx.x is an outer loop, but it didn’t, could someone tell me where, can I ask for a parallel way to make a parallel amount
source share