Yes, you definitely want to avoid the “false exchange” and ping-pong in the cash line. But this probably does not make sense: if these memory cells are thread-private more often than other streams collect them, they should be stored with other data in the stream so that you do not spend time on the 56 byte cache. See Also Cache method for collecting results from multiple threads . (There is no big answer; do not design a system that requires very fine-grained results if you can.)
But let's assume for a minute that the unused padding between the slots for different threads is what you want.
Yes, you need the step to be 64 bytes (1 cache line), but in fact you do not need the 8B that you use to be at the beginning of each cache line. Thus, you do not need any additional alignment if the uint64_t objects uint64_t naturally aligned (therefore, they do not split along the boundary of the cache line).
Well, if each thread writes to the third qword of its cache line instead of the 1st. OTOH matching 64B ensures that nothing else will share the cache line with the first element, and that’s easy, so we could as well.
Static Storage : Aligning static storage is very easy in ISO C11 using alignas() or with compiler-specific stuff.
Using a structure, padding is implicit to make the size a multiple of the desired alignment. Having one member requiring alignment implies that the entire structure requires at least that alignment. The compiler will take care of this for you with static and automatic storage, but you should use aligned_alloc or an alternative for verified dynamic allocation.
#include <stdalign.h> // for
Or with an array as @Eric Postpischil suggested :
alignas(64) // optional, stride will still be 64B without this. uint64_t rdtscp_values_2d[32][8]; // 8 uint64_t per cache line void bar(unsigned t) { rdtscp_values_2d[t][0] = 1; }
alignas() is optional unless you care that everything is consistent with 64B, just having a 64B step between the elements you use. You can also use __attribute__((aligned(64))) in GNU C or C ++ or __declspec(align(64)) for MSVC, using #ifdef to define an ALIGN macro that is portable through the main x86 compilers.
In any case, it turns out the same asm. We can check the compiler output to make sure that we got what we wanted. I placed it in the Godbolt compiler explorer . We get:
foo:
Both arrays are declared the same way, and the compiler requests 64B alignment from the assembler / linker with the 3rd argument to .comm
.comm rdtscp_values_2d,2048,64 .comm rdtscp_values,2048,64
Dynamic storage :
If the number of threads is not a compile time constant, then you can use the aligned allocation function to align dynamically allocated memory (especially if you want to support a very large number of threads). See How to fix 32-byte alignment for AVX load / store operations? but really just use C11 aligned_alloc . It is perfect for this and returns a pointer compatible with free() .
struct { alignas(64) uint64_t v; } *dynamic_rdtscp_values; void init(unsigned nthreads) { size_t sz = sizeof(dynamic_rdtscp_values[0]); dynamic_rdtscp_values = aligned_alloc(nthreads*sz, sz); } void baz(unsigned t) { dynamic_rdtscp_values[t].v = 1; } baz: mov rax, qword ptr [rip + dynamic_rdtscp_values] mov ecx, edi # same code as before to scale by 64 bytes shl rcx, 6 mov qword ptr [rax + rcx], 1 ret
The array address is no longer a reference time constant, so there is an additional level of indirection for accessing it. But the pointer is read-only after its initialization, so it will remain open in the cache in each core and reload it when necessary, very cheap.
Footnote: in i386 System V ABI, uint64_t by default only has 4B alignment inside structures (without alignas(8) or __attribute__((aligned(8))) ), so if you put an int before a uint64_t and didn "Do not alignment of the whole structure, it would be possible to get layouts in the cache line, but compilers align it to 8B whenever possible, so your structure with the addition is still beautiful.