How to avoid invalidation of a line in the cache from several threads writing to a common array?

Problem Context:

I write code that creates 32 threads, and establishes their proximity to each of the 32 cores in my multi-core multiprocessor system.

The threads simply execute the RDTSCP command, and the value is stored in the general array in a non-overlapping position, this is a common array:

uint64_t rdtscp_values[32]; 

So, each thread will write to a specific position of the array based on its main number.

So that everyone knows, everything works correctly, except that I know that I cannot use the correct data structure to avoid row bouncing .

PS: I already checked that my processor cache line is 64 bytes wide.

Since I use a simple uint64_t array, this means that one cache line will store 8 positions of this array due to reading.

Question:

Because of this simple array, although threads are written to different indexes, my understanding is that each write to this array will invalidate the cache to all other threads?

How to create a structure that is aligned with the line of the cache?

EDIT 1

My system: 2x Intel Xeon E5-2670 2.30GHz (8 cores, 16 threads)

+5
source share
2 answers

Yes, you definitely want to avoid the “false exchange” and ping-pong in the cash line. But this probably does not make sense: if these memory cells are thread-private more often than other streams collect them, they should be stored with other data in the stream so that you do not spend time on the 56 byte cache. See Also Cache method for collecting results from multiple threads . (There is no big answer; do not design a system that requires very fine-grained results if you can.)


But let's assume for a minute that the unused padding between the slots for different threads is what you want.

Yes, you need the step to be 64 bytes (1 cache line), but in fact you do not need the 8B that you use to be at the beginning of each cache line. Thus, you do not need any additional alignment if the uint64_t objects uint64_t naturally aligned (therefore, they do not split along the boundary of the cache line).

Well, if each thread writes to the third qword of its cache line instead of the 1st. OTOH matching 64B ensures that nothing else will share the cache line with the first element, and that’s easy, so we could as well.


Static Storage : Aligning static storage is very easy in ISO C11 using alignas() or with compiler-specific stuff.

Using a structure, padding is implicit to make the size a multiple of the desired alignment. Having one member requiring alignment implies that the entire structure requires at least that alignment. The compiler will take care of this for you with static and automatic storage, but you should use aligned_alloc or an alternative for verified dynamic allocation.

 #include <stdalign.h> // for #define alignas _Alignas for C++ compat #include <stdint.h> // for uint64_t // compiler knows the padding is just padding struct { alignas(64) uint64_t v; } rdtscp_values[32]; int foo(unsigned t) { rdtscp_values[t].v = 1; return sizeof(rdtscp_values[0]); // yes, this is 64 } 

Or with an array as @Eric Postpischil suggested :

 alignas(64) // optional, stride will still be 64B without this. uint64_t rdtscp_values_2d[32][8]; // 8 uint64_t per cache line void bar(unsigned t) { rdtscp_values_2d[t][0] = 1; } 

alignas() is optional unless you care that everything is consistent with 64B, just having a 64B step between the elements you use. You can also use __attribute__((aligned(64))) in GNU C or C ++ or __declspec(align(64)) for MSVC, using #ifdef to define an ALIGN macro that is portable through the main x86 compilers.


In any case, it turns out the same asm. We can check the compiler output to make sure that we got what we wanted. I placed it in the Godbolt compiler explorer . We get:

 foo: # and same for bar mov eax, edi # zero extend 32-bit to 64-bit shl rax, 6 # *64 is the same as <<6 mov qword ptr [rax + rdtscp_values], 1 # store 1 mov eax, 64 # return value = 64 = sizeof(struct) ret 

Both arrays are declared the same way, and the compiler requests 64B alignment from the assembler / linker with the 3rd argument to .comm

  .comm rdtscp_values_2d,2048,64 .comm rdtscp_values,2048,64 

Dynamic storage :

If the number of threads is not a compile time constant, then you can use the aligned allocation function to align dynamically allocated memory (especially if you want to support a very large number of threads). See How to fix 32-byte alignment for AVX load / store operations? but really just use C11 aligned_alloc . It is perfect for this and returns a pointer compatible with free() .

 struct { alignas(64) uint64_t v; } *dynamic_rdtscp_values; void init(unsigned nthreads) { size_t sz = sizeof(dynamic_rdtscp_values[0]); dynamic_rdtscp_values = aligned_alloc(nthreads*sz, sz); } void baz(unsigned t) { dynamic_rdtscp_values[t].v = 1; } baz: mov rax, qword ptr [rip + dynamic_rdtscp_values] mov ecx, edi # same code as before to scale by 64 bytes shl rcx, 6 mov qword ptr [rax + rcx], 1 ret 

The array address is no longer a reference time constant, so there is an additional level of indirection for accessing it. But the pointer is read-only after its initialization, so it will remain open in the cache in each core and reload it when necessary, very cheap.


Footnote: in i386 System V ABI, uint64_t by default only has 4B alignment inside structures (without alignas(8) or __attribute__((aligned(8))) ), so if you put an int before a uint64_t and didn "Do not alignment of the whole structure, it would be possible to get layouts in the cache line, but compilers align it to 8B whenever possible, so your structure with the addition is still beautiful.

+2
source

DECISION

So, I followed the comments here, and I have to say thanks for all the contributions.

Finally, I got what I expected: cache lines are used correctly for each thread.

Here is the general structure:

 typedef struct align_st { uint64_t v; uint64_t padding[7]; } align_st_t __attribute__ ((aligned (64))); 

I use uint64_t padding[7] inside the structure to fill in the remaining bytes in the cache line when this structure is loaded into L1 cache. However, I ask the compiler to use 64 byte alignment when compiling __attribute__ ((aligned (64))) .

So, I distribute this structure dynamically based on the number of cores, using memalign() to do this:

 align_st_t *al = (align_st_t*) memalign(64, n_cores * sizeof(align_st_t)); 

To compare this, I wrote one version of the code ( V1 ) that uses these consistent mechanisms and another version of the code ( V2 ) that uses a simple array method.

Performing with perf and I got these numbers:

  • V1 : 7.184 cache misses;
  • V2 : 2.621.347 caching.

PS : each stream writes 1 thousand times at the same address of the general structure only to increase the number

0
source

Source: https://habr.com/ru/post/1272846/


All Articles