Read-only memory is optimized for broadcast, that is, when the threads in the core all read the same memory location. If they read different locations, they will work, but every other place that the warp refers to is worth more time. When reading is streamed, read-only memory is MUCH faster than texture memory.
Texture memory has a high latency, even for cache hits. You can think of it as a bandwidth aggregator - if reuse can be done from the texture cache, then for these readings the GPU should not go to external memory. For 2D and 3D textures, addressing has two-dimensional and three-dimensional locality; therefore, the cache line fills 2D and 3D memory blocks instead of lines.
Finally, the texture pipeline can perform “bonus” calculations: processing boundary conditions (“texture addressing”) and converting 8- and 16-bit values to a unified float are examples of operations that can be performed “for free”. (they are part of the reason why reading texture has a high delay)
source share