CUDA finds the maximum value in the given array

I tried to develop a small CUDA program to find the maximum value in a given array,

int input_data[0...50] = 1,2,3,4,5....,50 

max_value , initialized with the first value input_data[0] , The final answer is stored in result[0] . The kernel gives 0 as the maximum value. I do not know what's the problem. I performed 1 thread 50.

 __device__ int lock=0; __global__ void max(float *input_data,float *result) { float max_value = input_data[0]; int tid = threadIdx.x; if( input_data[tid] > max_value) { do{} while(atomicCAS(&lock,0,1)); max_value=input_data[tid]; __threadfence(); lock=0; } __syncthreads(); result[0]=max_value; //Final result of max value } 

Despite the fact that there are built-in functions, I just deal with small problems.

+4
source share
4 answers

You are trying to set up a "critical section", but this approach on CUDA can cause your entire program to freeze - try to avoid it whenever possible.

Why is your code hanging?

Your core ( __global__ function) is executed by groups of 32 threads called warps. All flows within the same deformation are performed synchronously. Thus, the deformation will stop in your do{} while(atomicCAS(&lock,0,1)) , until all the threads of your warp manage to get a lock. But, obviously, you want to prevent multiple threads from simultaneously executing a critical partition. This leads to a hang.

Alternative solution

You need a "parallel reduction algorithm." You can start reading here:

+4
source

There is a potential race in your code. I'm not sure if you specified the max_value variable in shared memory or not, but both of them are wrong.

1) If "max_value" is only a local variable, then each stream contains a local copy that is not the actual maximum value (this is just the maximum value between input_data [0] and input_data [tid]). In the last line of code, all streams write the result is [0] its own max_value, which will lead to undefined behavior.

2) If "max_value" is a shared variable, 49 threads will go into the if-statements block and they will try to update the "max_value" one by one using locks. But the execution order among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value with smaller values. You will need to compare the maximum value again in the critical section.

+1
source

Max is the “abbreviation” - check the abbreviation pattern in the SDK and make max instead of summing.

White paper is a bit old, but still quite useful:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

The ultimate optimization step is to use warp synchronous coding to avoid unnecessary __syncthreads () calls.

This requires at least 2 accesses to the kernel - one to write a bunch of intermediate values ​​max () to the global memory, then another to accept max () from this array.

If you want to do this in a single kernel call, check out the threadfenceReduction SDK sample. This uses __threadfence () and atomicAdd () to track progress, and then 1 block makes a final reduction when all the blocks have finished writing their intermediate results.

+1
source

There are various references to variables. when you define a variable using the device , then this variable is placed in the global memory of the GPU and is available for all threads in the grid, shared places this variable in the block shared memory, and it is accessible only by the threads of this block, in the end, if you do not use any or a keyword, for example float max_value , then the variable is placed in the flow registers, and it can be accessed only in this stream. your code for each thread has a local variable max_value and does not identify variables in other threads.

-1
source

Source: https://habr.com/ru/post/1343284/


All Articles