In terms of shared memory, __syncthreads() simply stronger than __threadfence() . As for global memory, these are two different things.
__threadfence_block() stops the current thread until all entries in the shared memory are visible to other threads from the same block. This prevents the compiler from being optimized by caching shared memory entries in registers. It does not synchronize threads, and there is no need for all threads to actually reach this instruction.__threadfence() stops the current thread until all writes to the shared and global memory are visible to all other threads.__syncthreads() must be reached by all threads from the block (for example, without divergent if ) and ensures that the code preceding the instruction is executed before subsequent instructions for all threads in the block.
In your particular case, the __threadfence() command is used to make sure that entries in the global result array are visible to everyone. __syncthreads() will simply synchronize threads only in the current block, without providing a global memory record for another block. Moreover, at this point in the code you are inside the if branch, only one thread executes this code; using __syncthreads() will lead to undefined GPU behavior, which is likely to lead to complete kernel desynchronization.
Check out the chapters below in the CUDA C Programming Guide:
- 3.2.2 "Shared memory" - an example of matrix multiplication
- 5.4.3 "Sync instructions"
- B.2.5 "volatile"
- B.5 "Memory retrieval functions"
source share