I see no excuse for maximizing the sequential use of a single cache line. The cache does not work "one line at a time", and, as a rule, there is no advantage in using one line of the cache slightly different than using any of the lines in the cache.
The best goal is to maximize the number of hits that are served from a line in the L1 cache, instead of having to get from a slow cache or memory. As long as the access "falls" into the line located in the cache, we do not care which line is in the cache.
i5-2500K is a Sandy Bridge processor. The Sandy Bridge L1 data cache is 32 KiB and is an eight-way associative, with 64-byte cache lines. This means that the 32,768 byte cache has 512 lines, and they are organized into 64 sets of eight lines each. Each memory address corresponds to one set, as shown below. Each set stores eight cache lines from the lines that were recently used in this set. (The replacement algorithm is not least used recently, but it is an attempt to be useful and may have similar results for recent use.)
A cache search works as follows:
- Given the address of byte x, let t = gender (x / 64) (due to cache line size).
- Let s = t% 64 (to select a set).
- Check the set s to see if it contains a byte at address x.
Consider the effect of line length on these cache requests. With a string length of 65,536 bytes, the addresses of the array elements a [i] [j] and [i + 1] [j] differ by 65,536 bytes. This means that their values โโfor t in the above search procedure differ exactly 1024, and their values โโfor s are identical. Therefore, they are matched with the same set.
As soon as the algorithm moves up or down more than eight lines, without changing the columns outside the cache line, the single cache code used cannot process the nine recently used cache lines. One of them must be evicted. In fact, the cache size is eight lines (512 bytes) instead of 512 lines (32,768 bytes).
An easy way to solve this problem is to overlay the array so that the strings are 65,536 + p bytes, for some amount of padding p. The array will be allocated extra space and will be defined with longer lines than usual. Additional columns can usually be ignored. There is no need to initialize them; we donโt care about their content, just about how they affect addresses. (Alternatively, they can be used for additional data, if convenient for the program.)
With this filling, the distance between [i] [j] and [i + 1] [j] is 65,536 + p bytes, so the difference in t is 1024 + p / 64, and the difference in s is p / 64% 64. For example, if p is 64 or 320, the difference in s is 1 or 5, respectively.
I suggest testing 9 * 64 for p. Any value of 64 or more ensures that the elements of the array in the same column in consecutive rows are matched against different sets of caches. However, the algorithm described in the question is distributed both in columns and in rows. Thus, if p was small, our fix to map consecutive lines to different sets of caches can be nullified by moving columns that return to the same set of caches. Other p values โโshould also be checked.
This is not intended to completely solve the problem, as there are many factors that affect performance.