Access to parallel memory on modern processors

Question

Access to parallel memory on modern processors

I have a recent Intel 12-core CPU (Haswell architecture) and has 4 memory channels. How many DRAM memory accesses can a machine make in parallel?

For example, if I have a program that uses 12 threads that sit in a narrow loop that reads one byte from random memory addresses in a range that is too large to fit the cache. I expect all 12 threads to spend almost all of their time waiting for a memory sample.

Do threads need to take turns using the DRAM bus?

NOTE. Suppose I use a VM page size of 1 GB, so there are no TLB cache gaps.

+8

x86 parallel-processing memory multicore bus

Andrew Bainbridge Jul 21 '17 at 11:07

source share

1 answer

Andrew Bainbridge · Accepted Answer · 2017-07-23T10:07:04+0000

Intel tables almost answer this question.

My first clue was a question on Intel forums: https://communities.intel.com/thread/110798

Jaehyuk.Lee, 01-Feb-2017 09:27 asked almost the same question as me:

The second question concerns simultaneous IMC requests and its support for new processor models, such as Skylake and Kaby-Lake http://www.intel.com/Assets/PDF/datasheet/323341.pdf After the link above, "The memory controller can run up to 32 simultaneous requests (reads and writes) "I would like to know how many simultaneous requests are supported in the skylake and kabylake processors. I already checked the 6th and 7th generation of the Intel processor specification, but I can not find any information.

Link is dead. But his figure of "32" sounds believable.

In response, an Intel employee citing the 6th Generation Intel® Processor Family for S-Platforms, Volume 1 :

The memory controller has an advanced command scheduler, where all pending requests are considered simultaneously to determine the most efficient request will be issued as follows. The most efficient request is selected from all pending requests and issued to the system memory just in time to optimally use command overlap. Thus, instead of all requests for access to memory passing individually through the arbitration mechanism, which forces requests to be executed one at a time, they can be launched without interfering with the current request, allowing the simultaneous issuance of requests. This allows you to optimize bandwidth and reduced latency while saving the appropriate interval command to match the system memory protocol.

Unfortunately, the data table for my Xeon E5-2670 v3 does not contain an equivalent section.

Another part of the answer is that the E5-2670 has 4 DDR channels. The memory is interleaved with a granularity of 256 bytes to optimize bandwidth. In other words, if you are reading a 1024-byte block from address 0, the first 256 bytes are extracted from DIMM 0. Bytes 256–131 are from DIMM 1, etc.

By connecting them together, I suspect that the memory controller can perform 4 reads in parallel and is smart enough that if 4 or more threads are waiting for reads that map to 4 different DIMMs, it will execute them in parallel. And he has enough equipment to support up to 32 read / write operations in his planning table.

I can come up with another possible way to achieve concurrency. Each DDR channel has its own data and address buses. When the memory controller requests a read, it uses the address lines + some control lines to request a read, and then waits for a response. For random reading, as a rule, there are two expectations - the RAS-CAS delay and the CAS delay — about 15 cycles each. Instead of leaving the address lines inactive, you can imagine that the memory controller starts another read from another DIMM (*) during these waiting periods. I have no idea if this will be done.

* In fact, according to this Anandtech article , DRAM hardware has more parallelism than just having multiple DIMMs per channel. Each DIMM can have several ranks, and each rank has many banks. I think that you can switch to any other rank and bank in the DIMM to perform another access in parallel.

EDIT

I measured that my machine can perform at least 6 random calls in parallel , despite having only 4 memory channels. Thus, one memory channel can perform 2 or more random concurrent accesses, possibly using the scheme described in the previous paragraph.

To get this information, I used tinymembench to measure the DRAM access delay on my machine. The result was 60 ns. Then I wrote a small C program to do a 32-bit read from a 1 GB random number table and use the result to increase the checksum. Pseudocode:

uint32_t checksum = 0; for (int i = 0; i < 256 * 1024 * 1024; i++) { unsigned offset = rand32() & (TABLE_SIZE - 1); checksum += table_of_random_numbers[offset]; }

Each iteration of the cycle took on average 10 ns. This is due to the fact that the functions of disordered and speculative execution in my processor were able to parallelize this cycle 6 times. i.e. 10 ns = 60 ns / 6.

If instead I replaced the code with:

 unsigned offset = rand32() & (TABLE_SIZE - 1); for (int i = 0; i < 256 * 1024 * 1024; i++) { offset = table_of_random_numbers[offset]; offset &= (TABLE_SIZE - 1); }

Then each iteration takes 60 ns, because the loop cannot be paralyzed. It cannot be paralyzed, since the address of each access depends on the result of the previous reading.

I also checked the assembly generated by the compiler to make sure that it did not perform parallelization.

UPDATE 2

I decided to check what happens when I run several tests in parallel, each as a separate process. I used the above program fragment, which includes a checksum (i.e. one that seems to show a 10 ns access delay). By running 6 instances in parallel, I get an average apparent delay of 13.9 ns, which means that about 26 calls should occur in parallel. (60 ns / 13.9 ns) * 6 = 25.9.

6 copies were optimal. Moreover, overall throughput has decreased.

UPDATE 3 - Answer to the RNG question by Peter Cordes

I tried two different random number generators.

 uint32_t g_seed = 12345; uint32_t fastrand() { g_seed = 214013 * g_seed + 2531011; return g_seed; }

and

 // *Really* minimal PCG32 code / (c) 2014 ME O'Neill / pcg-random.org // Licensed under Apache License 2.0 (NO WARRANTY, etc. see website) typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t; uint32_t pcg32_random_r(pcg32_random_t* rng) { uint64_t oldstate = rng->state; // Advance internal state rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1); // Calculate output function (XSH RR), uses old state for max ILP uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u; uint32_t rot = oldstate >> 59u; return (xorshifted >> rot) | (xorshifted << ((-rot) & 31)); }

They both performed roughly the same. I can’t remember the exact numbers. The peak single-threaded performance I saw was with the simpler RNG, and this gave me an amortized delay of 8.5 ns, which implied 7 reads in parallel. Assembly for synchronized loop:

 // Pseudo random number is in edx // table is in rdi // loop counter is in rdx // checksum is in rax .L8: imull $214013, %edx, %edx addl $2531011, %edx movl %edx, %esi movl %edx, g_seed(%rip) andl $1073741823, %esi movzbl (%rdi,%rsi), %esi addq %rsi, %rax subq $1, %rcx jne .L8 ret

I do not understand "g_seed (% rip)". Is it memory access? Why would the compiler do this?

UPDATE 4 - Removed global variable from random number generator

I removed the global variable from the random number generator, as Peter suggested. The generated code was really cleaner. I also switched to Intel syntax for disassembly (thanks for the help).

 // Pseudo random number is in edx // table is in rdi // loop counter is in rdx // checksum is in rax .L8: imul edx, edx, 214013 add edx, 2531011 mov esi, edx and esi, 1073741823 movzx esi, BYTE PTR [rdi+rsi] add rax, rsi sub rcx, 1 jne .L8 ret

However, performance has remained unchanged in both individual and multiprocess cases.

Access to parallel memory on modern processors

More articles: