Did not check your code in depth, but here are some tips.
I would advise using Thrust if you only use your GPU for such general tasks, because if there is a difficult problem, people have no idea to effectively program parallel on gpu.
Start a new parallel reduction core to summarize the point product.
Since the data is already on the device, you will not see a decrease in performance, starting with the new kernel.
The kernel does not seem to scale according to the maximum number of possible blocks on the latest GPU. If that were the case, and your kernel could calculate the point product from millions of values, productivity would drop dramatically due to serialized atomic operation.
Entry-level error: have your input and shared memory access range been verified? Or are you sure that the input is always a multiple of your block size? Otherwise, you will read the trash. Most of my incorrect results were caused by this error.
optimize parallel contraction. My Thesis or Optimization Mark Harris
Unconfirmed, I just wrote it in a notebook:
/* * @param inCount_s unsigned long long int Length of both input arrays * @param inValues1_g double* First value array * @param inValues2_g double* Second value array * @param outDots_g double* Output dots of each block, length equals the number of blocks */ __global__ void dotProduct(const unsigned long long int inCount_s, const double* inValuesA_g, const double* inValuesB_g, double* outDots_g) { //get unique block index in a possible 3D Grid const unsigned long long int blockId = blockIdx.x //1D + blockIdx.y * gridDim.x //2D + gridDim.x * gridDim.y * blockIdx.z; //3D //block dimension uses only x-coordinate const unsigned long long int tId = blockId * blockDim.x + threadIdx.x; /* * shared value pair products array, where BLOCK_SIZE power of 2 * * To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element! * (outDots_g length decreases by same factor, and you need to range check and initialize memory) * -> see harris gpu optimisations / parallel reduction slides for more informations. */ __shared__ double dots_s[BLOCK_SIZE]; /* * initialize shared memory array and calculate dot product of two values, * shared memory always needs to be initialized, its never 0 by default, else garbage is read later! */ if(tId < inCount_s) dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId]; else dots_s[threadIdx.x] = 0; __syncthreads(); //do parallel reduction on shared memory array to sum up values reductionAdd(dots_s, dots_s[0]) //see my thesis link //output value if(threadIdx.x == 0) outDots_g[0] = dots_s[0]; //start new parallel reduction kernel to sum up outDots_g! }
Edit: Removed unnecessary points.
source share