How is 3D texture memory cached?

I have an application in which 96% of the time spent reading the interpolation of three-dimensional texture memory (red dots on the diagram).

My cores are designed to read 1000 ~ memory on a line that randomly crosses texture memory, stream per line (blue lines). These lines are tightly packed, very close to each other, moving in almost parallel directions.

The image depicts the concept of what I'm talking about. Imagine that the image is a single β€œslice” of three-dimensional texture memory, for example. z=24 . The image is repeated for all z . enter image description here

At the moment, I only execute threads one line after another, but I realized that I could use the texture memory area if I call adjacent lines in one block, reducing read time in memory.

My questions

  • If I have a 3D texture with linear interpolation, how can I benefit from the data area? Running adjacent lines in the same block in 2D or adjacent lines in 3D (3D neighbors or just neighbors to a slice)?

  • How big is the cache (or how can I check this in the specs)? Does it load, for example. the requested voxel and +50 around it in all directions? This will be directly related to the number of neighboring lines that I put in each block!

  • How is interpolation applied to texture memory cache? Is interpolation also performed in the cache, or is the fact that interpolating it will reduce memory latency because it needs to be done in the text memory itself?


Work with NVIDIA TESLA K40, CUDA 7.5, if that helps.

+5
source share
1 answer

Since this question is aging and no answers seem to exist for some of the questions I asked, I will give a control answer based on my TIGRE toolkit research project. You can get the source code in the Github repo .

Since the answer is based on the specific application of tools, computed tomography, this means that my results are not necessarily true for all applications that use texture memory. In addition, my GPU (see above) is pretty good, so your mileage may vary in different hardware.


Features

Important to note: these are conical beam computed tomography applications. It means that:

  • The lines are more or less evenly distributed along the image, covering most of its
  • The lines are more or less parallel to adjacent lines and will predominantly be in the same plane. For instance. They are always more or less horizontal, never vertical.
  • The sampling frequency over the rows is the same, which means that adjacent rows will always try the next point very close to each other.

All this information is important for the memory location.

In addition, as stated in the question, 96% of the kernel time is reading in memory, so we can safely assume that the change in the amount of kernel time is associated with changes in the reading speed in memory.


Questions

If I have a linear interpolated 3D texture, how can I get the most out of the data area? By executing adjacent rows in the same block in 2D or adjacent rows in 3D (3D neighbors or just cut neighbors)?

As soon as a person becomes a little more experienced with texture memory, he sees the direct answer: follow as many adjacent lines together as possible. The closer memory data is read to each other, the better.

This is effective for tomography means using square blocks of pixel detectors. Packing rays (blue lines in the original image) together.

How big is the cache (or how can I check this in the specs)? Does it load, for example. the requested voxel and +50 around it in all directions? This will be directly related to the number of neighboring lines that I put in each block!

While it is impossible to say, empirically I have found that it is better to work with smaller blocks. My results show that for a 512 ^ 3 image with 512 ^ 2 beams with a sampling frequency of ~ 2 samples / voxel, the block size is:

 32x32 -> [18~25] ms 16x16 -> [14~18] ms 8x8 -> [11~14] ms 4x4 -> [25~29] ms 

Block sizes are the size of a square adjacent beam that is calculated together. For instance. 32x32 means that 1024 Xrays will be computed in parallel, next to each other in a 32x32 square. Since exactly the same operations are performed on each row, this means that samples are taken around the 32x32 plane in the image, covering approximately 32x32x1 indices.

It is predictable that at some point, when the size of the blocks decreases, the speed will be slower, but this is (at least for me) a surprisingly low value. I think this means that the memory cache loads relatively small chunks of data from the image.

This result shows additional information that was not asked in the original question: what happens to samples outside the boundaries with respect to velocity . Since adding any if condition to the kernel will significantly slow it down, then as I programmed the kernel, you can start sampling at a point in the line that will be inaccessible to the image and will stop in a similar case. This was done by creating an imaginary "sphere" around the image and always sampling the same amount, regardless of the angle between the image and the lines themselves.

If you see the time for each core that I showed, you would notice that they are all [t ~sqrt(2)*t] , and I checked whether it is really more when the angle between the lines and the image is a multiple of 45 degrees, where more images fall inside the image (texture).

This means that the sample from the image index ( tex3d(tex, -5,-5,-5) ) is not computational . No time to spend reading outside. It is better to read a lot of points outside than to check if points fall inside the image, since the if condition slows down the kernel, and fetching outside has zero cost.

How is interpolation applied to texture memory cache? Is interpolation also performed in the cache, or is the fact that its interpolation will reduce memory latency because it must be performed in the text memory itself?

To test this, I ran the same code, but with linear interpolation ( cudaFilterModeLinear ) and nearest neighbor interpolation ( cudaFilterModePoint ). As expected, speed improves when interpolation of the closest neighbor is added. For 8x8 blocks with the previously indicated image sizes on my computer:

 Linear -> [11~14] ms Nearest -> [ 9~10] ms 

The acceleration is not massive, but significant. These hints, as expected, show that the time taken by the cache to interpolate data is measurable, so you need to be aware of this when developing applications.

+2
source

Source: https://habr.com/ru/post/1244873/


All Articles