Stream Multiprocessors (SM) processors have caches, but they are relatively small and will not help with truly random access.
Instead, one of the ideas underlying GPUs is to mask the delay in access to memory: each SM is assigned multiple threads to execute, more than it has kernels. In each free hour mode, he plans some threads that are not blocked when accessing memory. When the data needed for the stream is not in the SM cache, then the stream stops until the data arrives and another stream is selected for execution.
Note that the working assumption is that you are doing heavy calculations. If all you do is just a little computation on a lot of data, for example. just adding up a lot of 32-bit floats, it is very likely that the bottleneck will be in the memory bus bandwidth, and most of the time your threads will be stopped waiting for their bits to arrive.
In practice, although you do some heavy data calculations. For instance. you get input normals and material parameters, and then do a big calculation of lighting on them. Here, while some threads are performing calculations, others are waiting for their data to arrive.
source share