Memory Bandwidth Performance for Modern Machines

Question

Memory Bandwidth Performance for Modern Machines

I am developing a real-time system that sometimes has to duplicate a large amount of memory. The memory consists of small areas, so I expect that the copy performance will be close enough to the maximum bandwidth that the corresponding components (CPU, RAM, MB) can perform. It made me think about what type of memory for primary memory can be used to collect product data?

My aging Core2Duo gives me 1.5 GB / s if I use 1 stream before memcpy() (and, of course, less if I memcpy() with both cores at the same time.) Although 1.5 GB is enough data, In real time, the application I'm working on will have approximately 1/50 of a second, which means 30 MB. In fact, almost nothing. And perhaps worst of all, as I add several cores, I can process much more data without increasing performance for the necessary duplication step.

But the low-end Core2Due these days is not quite hot. Are there any sites with information, such as actual benchmarks, about raw memory bandwidth in the current and near future?

Also, to duplicate large amounts of data in memory, are there any shortcuts or is memcpy() as good as it gets?

Given a bunch of cores that have nothing to do with duplicating as much memory as possible in a short amount of time, what can I do best?

EDIT: I'm still looking for good information on raw memory performance. I just ran my old memcpy() test. The same computer and settings now give 2.5 GB / s ...

+4

performance memory hardware memcpy

porgarmingduod Mar 18 '10 at 1:38

source share

2 answers

mch · Answer 1 · 2010-03-18T01:50:01+0000

In newer processors, such as Nehalem, and on AMD with Opteron, the memory is "local" to one processor, where one processor can have several cores. That is, the kernel requires the kernel to access the local memory attached to it, and more time for the kernel to access the remote memory, where the remote memory is the memory that is local to other CPUs. This is called Uneven Memory Access or NUMA. For better memcpy performance, you want to configure the BIOS in NUMA mode, bind your threads to the kernels, and always access local memory. Read more about NUMA on wikipedia .

Unfortunately, I do not know about the site or the latest memcpy performance documents on the latest processors and chipsets. The best thing is probably to check it yourself.

Regarding memcpy() performance, there are wide variations depending on the implementation. The Intel C library (or perhaps the compiler itself) has memcpy() , which is much faster than the one provided in Visual Studio 2005, for example. At least on Intel computers.

The best copy of memory you can make will depend on the alignment of your data if you can use vector instructions, page size, etc. Implementing good memcpy() surprisingly related, so I recommend finding and testing as many implementations as possible before writing your own. If you know more about your copy, such as alignment and size, you can implement something faster than Intel memcpy() . If you want to read the details, you can start with the Intel and AMD Optimization Guides or the Agner Fog Software Optimization Page .

Rex kerr · Answer 2 · 2010-03-18T22:42:50+0000

I think you are approaching the problem incorrectly. I assume the goal is to export a consistent snapshot of your data without destroying real-time performance. Do not use hardware, use an algorithm.

What you want to do is define a logging system on top of your data. When you start the transfer in memory, you have two streams: the original, which works, and believes that it modifies the data (but in fact it is only a log entry), and a new stream that copies the old (not registered) data to a separate place so that he can slowly record it.

When the new thread is completed, you put it into operation by combining the data set with the log until the log is empty. When it finishes, the old thread may return to interacting directly with the data instead of reading / writing through the journal-modified version.

Finally, a new stream can go to the copied data and begin to slowly transfer it to a remote source.

If you set up such a system, you can get instant snapshots of arbitrarily large amounts of data in a running system, if you can finish a copy in memory before the log becomes so complete, the time system cannot keep up with its processing requirements .

Memory Bandwidth Performance for Modern Machines

More articles: