Kronecker parallel tensor product on gpu using CUDA

Question

Kronecker parallel tensor product on gpu using CUDA

I work in parallel mode [this file] [1] on the GPU using [PTX file with matlab parallel.gpu.CUDAkernel] [2] my problem with [tensor product kron] [3] in my code, it should multiply two vectors kron(a,b) by multiplying each element of the first vector a=<32x1> by all elements of another vector b=<1x32> , and the size of the output vector will be k<32x32>=a.*b , I tried to write it to CPP and worked because I only care about summing all the elements of the 2d array, I thought I could simplify it like a 1D array, because m=sum(sum(kron(a,b))) is the code I'm working on

 for(i=0;i<32;i++) for(j=0;j<32;j++) k[i*32+j]=a[i]*b[j]

this meant that the element a[i] th is multiplied by the eachelement in b , and although for the transition with 32 blocks each block has 32 threads, and the code should be

 __global__ void myKrom(int* c,int* a, int*b) { int i=blockDim.x*blockIdx.x+threadIdx.x; while(i<32) { c[i]=a[blockIdx.x]+b[blockDim.x*blockIdx.x+threadIdx.x]; }

which should do the trick since blockIdx.x is an outer loop, but it didn’t, could someone tell me where, can I ask for a parallel way to make a parallel amount

+4

parallel-processing matlab gpu cuda linear-algebra

pyCuda Nov 06 '12 at 10:57

source share

1 answer

ahmad · Accepted Answer · 2012-11-06T12:45:05+0000

You can actually say something like this:

 __global__ void myKrom(int* c,int* a, int*b) { int i=blockDim.x*blockIdx.x+threadIdx.x; if(i<32*32){ c[i]=a[blockIdx.x]+b[threadIdx.x]; } }

when you call the kernel using myKrom<<<32, 32>>> (c, a, b);

Kronecker parallel tensor product on gpu using CUDA

More articles: