Optimizing 2D array indexing for cache line

Question

Optimizing 2D array indexing for cache line

I'm trying to optimize the indexing of a large array of 2D bytes (well, 1D processed as 2D) in order to maximize the number of consecutive searches from the same cache line of 64 bytes in size. Each search is one from the previous one, alternating between moving horizontal and vertical. Regardless of whether the movement is positive or negative, it can be considered random (in fact, it follows the Langton ant rulestring RLR, but I do not think the information is strictly relevant), which means that the meander path is chaotic, trying to remain in one common area for quite some time.

With normal line indexing at a time, horizontal movement is likely to be within the same cache line, but vertical movement will never happen. My solution is to index the array in 8x8 blocks, here is an example as if the size of the cache line was 9 with a 6x6 array:

24 25 26 33 34 35 21 22 23 30 31 32 18 19 20 27 28 39 6 7 8 15 16 17 3 4 5 12 13 14 0 1 2 9 10 11

It also does not appear with 3x3 blocks, but it should allow reuse of the cache line:

  . . . 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 ...

I compared normal indexing and this indexing, and this indexing is slower. Perhaps this is due to the fact that he must do more to develop the desired index (it is in a narrow cycle, see here for the normal indexed version: How to optimize this Langton's ant sim? ). I cannot completely exclude the possibility of increasing the efficiency of this indexing (the development of a new index can be optimized if you think about the fact that the cache is new to me, and I probably am doing something bad).

1) Performance check. Is what I'm trying to make reasonable, can this work? Does it work in different conditions?

2) Longshot: Is there a magic gcc compiler flag that reorders indexing for you to try to optimize a 2D array instead of 1D?

3) Can I (or do I need) to do something to try to save certain cache lines to the CPU? I currently assume that the latest data is saved until overwritten.

4) If you have a better solution, describe it.

64-bit linux, gcc, i5-2500k

Edit: It turns out that: 1) This way of thinking was not reasonable, 2) N / A, 3) See the accepted answer, 4) See the accepted answer

+4

optimization c cpu-cache

John smith Jul 12 '13 at 23:41

source share

2 answers

This is probably not useful, but it may be interesting.

You can use Z-order addressing. It will display aligned 8x8 blocks for line caching, so as long as you stay within the same aligned 8x8 block, you always use the same cache line. But sometimes strange things happen when you move from one block to another.

Creating a Z-order address from a pair (x, y) is a little annoying:

 static uint Interleave(uint x, uint y) { y = (y | (y << 1)) & 0x00FF00FF; y = (y | (y << 2)) & 0x0F0F0F0F; y = (y | (y << 4)) & 0x33333333; y = (y | (y << 8)) & 0x55555555; x = (x | (x << 1)) & 0x00FF00FF; x = (x | (x << 2)) & 0x0F0F0F0F; x = (x | (x << 4)) & 0x33333333; x = (x | (x << 8)) & 0x55555555; return x | (y << 1); }

(this is C #, it should be easy to convert to C)

Less annoying if you have a processor that supports PDEP, which is still only Haswell.

But you probably don't need to do this often. You can directly increase or decrease the x or y part of the Z-order address (it is generalized to add any pair of constants (c1, c2) to the Z-address, if both of them are non-zero, this requires a bit more code), for example: ( these checks are not checked)

 static uint IncX(uint z) { uint xsum = (z | 0xAAAAAAAA) + 1; return (xsum & 0x55555555) | (z & 0xAAAAAAAA); } static uint IncY(uint z) { uint ysum = (z | 0x55555555) + 2; return (ysum & 0xAAAAAAAA) | (z & 0x55555555); } static uint DecX(uint z) { uint xsum = (z & 0x55555555) - 1; return (xsum & 0x55555555) | (z & 0xAAAAAAAA); } static uint DecY(uint z) { uint ysum = (z & 0xAAAAAAAA) - 2; return (ysum & 0xAAAAAAAA) | (z & 0x55555555); }

You can even drop some kinds of border checks. I have procedures to saturate the increment / decrement, I'm only 90% sure that they work. A wrapper modulo degree two is trivial, just do binary AND on the result.

Addressing with a Z-coordinate is trivial, just add it to the base of the array. Moving is a bit more complicated than in (x, y) space, but if you combine this with an idea from your other message (looking through the area) you don't even need to move (except for calculating this lookup table, obviously), Packing a good surrounding area can be harder. But there is a less good surrounding area that becomes trivial: shift the Z coordinate directly in the Z-space in both directions and take everything in between (say, from Z-8 to Z + 7). This will simulate fewer steps at the same time on average, since this is usually not a square block, and the current position is usually not in the middle, but the index in the search table is easier to calculate.

edit: it might be better to take a aligned block instead of a range, because ant can never go from one of the "parts" of a non-address range to another "part" (parts, at best, diagonally, so it will have to go beyond). It is also easy, simple, and least significant bits from the Z-coordinate to get the beginning of the aligned block. The lookup table will require these least significant bits, although they should be part of the index.

I do not expect this approach to win. But this is interesting, IMO.

+3

harold Jul 13 '13 at 14:23

source share

Eric Postpischil · Accepted Answer · 2013-07-13T11:07:14+0000

I see no excuse for maximizing the sequential use of a single cache line. The cache does not work "one line at a time", and, as a rule, there is no advantage in using one line of the cache slightly different than using any of the lines in the cache.

The best goal is to maximize the number of hits that are served from a line in the L1 cache, instead of having to get from a slow cache or memory. As long as the access "falls" into the line located in the cache, we do not care which line is in the cache.

i5-2500K is a Sandy Bridge processor. The Sandy Bridge L1 data cache is 32 KiB and is an eight-way associative, with 64-byte cache lines. This means that the 32,768 byte cache has 512 lines, and they are organized into 64 sets of eight lines each. Each memory address corresponds to one set, as shown below. Each set stores eight cache lines from the lines that were recently used in this set. (The replacement algorithm is not least used recently, but it is an attempt to be useful and may have similar results for recent use.)

A cache search works as follows:

Given the address of byte x, let t = gender (x / 64) (due to cache line size).
Let s = t% 64 (to select a set).
Check the set s to see if it contains a byte at address x.

Consider the effect of line length on these cache requests. With a string length of 65,536 bytes, the addresses of the array elements a [i] [j] and [i + 1] [j] differ by 65,536 bytes. This means that their values for t in the above search procedure differ exactly 1024, and their values for s are identical. Therefore, they are matched with the same set.

As soon as the algorithm moves up or down more than eight lines, without changing the columns outside the cache line, the single cache code used cannot process the nine recently used cache lines. One of them must be evicted. In fact, the cache size is eight lines (512 bytes) instead of 512 lines (32,768 bytes).

An easy way to solve this problem is to overlay the array so that the strings are 65,536 + p bytes, for some amount of padding p. The array will be allocated extra space and will be defined with longer lines than usual. Additional columns can usually be ignored. There is no need to initialize them; we don’t care about their content, just about how they affect addresses. (Alternatively, they can be used for additional data, if convenient for the program.)

With this filling, the distance between [i] [j] and [i + 1] [j] is 65,536 + p bytes, so the difference in t is 1024 + p / 64, and the difference in s is p / 64% 64. For example, if p is 64 or 320, the difference in s is 1 or 5, respectively.

I suggest testing 9 * 64 for p. Any value of 64 or more ensures that the elements of the array in the same column in consecutive rows are matched against different sets of caches. However, the algorithm described in the question is distributed both in columns and in rows. Thus, if p was small, our fix to map consecutive lines to different sets of caches can be nullified by moving columns that return to the same set of caches. Other p values should also be checked.

This is not intended to completely solve the problem, as there are many factors that affect performance.

Optimizing 2D array indexing for cache line

More articles: