This is a coding problem.
I have an i7-3820 with a 4 * 4GB DDR3 1600Mhz computer under Linux. According to the Intel specification, I believe that I can scan memory at 51.2 GB / s (not GiB / s). But unfortunately, I only get 40 GB / s.
First of all, I encoded the xmm boot procedure in the assembly. Suppose it is declared as
extern "C" { void load_mem_256b(int *start, int *end, int step, int *p_sum); }
The return value is the sum of the first int of all loaded integers to avoid optimization.
It will load 256 bits from the memory address indicated at the beginning, then start step by step * 8 (8 * sizeof (int) = 256 bits)
I tried two ways to read memory, the first way is to open 4 threads, divide the memory into 4 segments; another way is to open 4 threads, each thread loads the ith part 256b into 1024b and correctly synchronizes 4 threads.
The first method reached 40 GB / s, as I mentioned earlier. The second method is slower.
In the first method, if the memory is in ganged mode, there will be a lot of memory access on the other line. Since I have 2 bits per DIMM * 4 DIMM, I donβt know if it will work fine without performance degradation. In the second method, I assume that the memory load occurs only on the same line and allows you to distinguish the stream from different memory channels.
The first method is as follows:
for (int i = 0; i < number_of_threads; ++i) threads[i] = std::thread(std::bind( load_mem_256b, start + i * 8, end, number_of_threads, &(sums[i])));
The second way is as follows:
size_t amount = 32768; my::spin_barrier barrier(number_of_threads + 1); for (int i = 0; i < number_of_threads; ++i) threads[i] = std::thread(std::bind(load_mem_256b_barrier, start + i * 8, end, number_of_threads, &barrier, amount, &(sums[i]))); threads[number_of_threads] = std::thread(std::bind( prefetch, start, end, amount, &barrier));
Some additional data, in the first method, if I only open 1 or 2 or 3 streams, I can load the memory at a speed of 17 GB / s, 32 GB / s, 39 GB / s. I feel weird with all these numbers. If the memory works in unchanged mode, why can 1 thread load memory at 17 GB / s? (One channel can only send 12.8 GB / s). But if he works in gangster mode, why is the second method much slower than the first method?
And finally, how to actually load memory at theoretical speed?