CUDA replaces __syncthreads instead of __threadfence () difference

I copied the code below from the NVIDIA manual. For example: for __threadfence() . Why did they use __threadfence() in the code below. I think using __syncthreads() instead of __threadfence() will give you the same result.

Can someone explain the difference between __syncthreads() and __threadfence() calls?

 __device__ unsigned int count = 0; __shared__ bool isLastBlockDone; __global__ void sum(const float* array, unsigned int N,float* result) { // Each block sums a subset of the input array float partialSum = calculatePartialSum(array, N); if (threadIdx.x == 0) { // Thread 0 of each block stores the partial sum // to global memory result[blockIdx.x] = partialSum; // Thread 0 makes sure its result is visible to // all other threads __threadfence(); // Thread 0 of each block signals that it is done unsigned int value = atomicInc(&count, gridDim.x); // Thread 0 of each block determines if its block is // the last block to be done isLastBlockDone = (value == (gridDim.x - 1)); } // Synchronize to make sure that each thread reads // the correct value of isLastBlockDone __syncthreads(); if (isLastBlockDone) { // The last block sums the partial sums // stored in result[0 .. gridDim.x-1] float totalSum = calculateTotalSum(result); if (threadIdx.x == 0) { // Thread 0 of last block stores total sum // to global memory and resets count so that // next kernel call works properly result[0] = totalSum; count = 0; } } } 
+4
source share
1 answer

In terms of shared memory, __syncthreads() simply stronger than __threadfence() . As for global memory, these are two different things.

  • __threadfence_block() stops the current thread until all entries in the shared memory are visible to other threads from the same block. This prevents the compiler from being optimized by caching shared memory entries in registers. It does not synchronize threads, and there is no need for all threads to actually reach this instruction.
  • __threadfence() stops the current thread until all writes to the shared and global memory are visible to all other threads.
  • __syncthreads() must be reached by all threads from the block (for example, without divergent if ) and ensures that the code preceding the instruction is executed before subsequent instructions for all threads in the block.

In your particular case, the __threadfence() command is used to make sure that entries in the global result array are visible to everyone. __syncthreads() will simply synchronize threads only in the current block, without providing a global memory record for another block. Moreover, at this point in the code you are inside the if branch, only one thread executes this code; using __syncthreads() will lead to undefined GPU behavior, which is likely to lead to complete kernel desynchronization.

Check out the chapters below in the CUDA C Programming Guide:

  • 3.2.2 "Shared memory" - an example of matrix multiplication
  • 5.4.3 "Sync instructions"
  • B.2.5 "volatile"
  • B.5 "Memory retrieval functions"
+14
source

Source: https://habr.com/ru/post/1342938/


All Articles