Multiprocessors versus multithreading in the context of PThreads

I have an application level question (PThreads) regarding the choice of hardware and its impact on software development.

I have multi-threaded code, well tested on a multi-core single processor box.

I am trying to decide what to buy for my next car:

  • 6 core single processor
  • Dual-core quad-core processor

My question is, if I go to the dual processor unit, will this seriously affect my code porting? Or can I just allocate more threads and let the OS handle the rest?

In other words, is it multi-processor programming different from multi-threading (single processor) in the context of a PThreads application?

I thought that at this level it would not matter, but when setting up a new box I noticed that I needed to buy a separate memory for each processor. This is when I hit some cognitive dissonance.

More about the code (for those who are interested): I read a ton of data from the disk into a huge part of the memory (~ 24 GB will be more soon), then I create my streams. This initial chunk of memory is "read-only" (due to my own code rules), so I do not lock for this fragment. I got confused looking at the quad-core dual-core processors - they seem to require separate memory. In the context of my code, I have no idea what will happen β€œunder the hood” if I select a bunch of extra threads. Will the OS copy my piece of memory from one CPU memory bank to another? This will affect how much memory I have to buy (cost increase for this configuration). The ideal situation (economical and easy to program) is to have a fraction of two processors with one large memory bank, but if I understand correctly, it may not be possible for Intel's new dual-core MOBOs (for example, HP ProLiant ML350e)?

+4
source share
3 answers

Modern processors 1 process RAM locally and use a separate channel 2 for communication between them. This is a consumer-level version of NUMA created for supercomputers over ten years ago.

The idea is to avoid using a shared bus (the old FSB), which can cause severe conflicts because each core uses it to access memory. When you add more NUMA cells, you get higher throughput. The disadvantage is that the memory becomes uneven from the point of view of the processor: some RAM is faster than others.

Of course, modern OS schedulers are NUMA-aware, so they are trying to reduce the transfer of tasks from one cell to another. Sometimes it’s good to switch from one core to another in the same socket; sometimes there is a whole hierarchy that determines which resources (1-, 2-, 3-level cache, RAM channel, IO, etc.) are shared and which are not, and this determines whether the penalty will be or not by moving task, Sometimes it can determine that waiting for the correct kernel will be pointless, and it’s best to dig it all into another socket ....

In the vast majority of cases, it is better to leave the scheduler to do what he knows best. If not, you can play with numactl .

As for the specific case of this program; the best architecture is highly dependent on the level of resource sharing between threads. If each thread has its own playground and basically works only inside it, a reasonably reasonable allocator will prioritize the local RAM, which makes it less important which cell each stream will be in.

If, on the other hand, objects are distributed by one thread, processed by another and consumed by a third; performance will suffer if they are not in the same cell. You can try to create small groups of flows and limit the intensive exchange within the group, then each group can easily switch to another cell.

The worst case is when all the threads are involved in a big data exchange orgy. Even if you have all your locks and processes well-debugged, there will be no way to optimize it to use more cores than what is available in the cell. Perhaps it would be even better to limit the entire process to simply using cores in one cell, effectively spending the rest.

1 modern, I mean any AMD-64-bit chip, and Nehalem or better for Intel.

2 AMD calls this HyperTransport channel, and Intel's name is QuickPath Interconnect

EDIT:

You note that you are initializing a "large chunk of read-only memory". And then create a lot of threads to work on it. If each thread works on its own part of this piece, then it would be much better if you initialized it in the thread after it appeared. This will allow threads to span multiple cores, and the allocator will choose a local RAM for each, much more efficient layout. Maybe there is some way to hint to the scheduler to transfer the threads as soon as they appear, but I do not know the details.

EDIT 2:

If your data is read verbatim on disk, without any processing, it may be useful to use mmap instead of highlighting a large fragment and read() ing. There are some common benefits:

  • No need to pre-allocate RAM.
  • The mmap operation is almost instantaneous and you can start using it. Data will be considered lazy as needed.
  • The OS may be more reasonable than you, choosing between the application, mmap ed RAM, buffers and cache.
  • it's less code!
  • Not required data will not be read, will not use RAM.
  • You can specify it as read-only. Any error that attempts to write will cause coredump.
  • Since the OS knows this for read only, it cannot be dirty, so if RAM is required, it will simply drop it and re-read it if necessary.

but in this case you also get:

  • Because data is read lazily, each RAM page will be selected after the threads have been distributed to all available cores; this will allow the OS to select pages close to the process.

So, I think that if two conditions are satisfied:

  • data is not processed in any way between the disk and RAM.
  • each piece of data is read (mostly) by one stream that is not affected by all of them.

then using mmap you can use machines of any size.

If each piece of data is read by more than one stream, perhaps you can determine which streams will (basically) share the same pages, and try to hint to the scheduler to keep them in the same NUMA cell.

+8
source

For the x86 boxes you're looking at, the fact that the memory is physically connected to different processor sockets is a detailed implementation. It is logical that the total memory of the machine is displayed as one large pool - you will not need to change the application code so that it works correctly on both CPUs.

Performance, however, is another matter. There is a speed limit for accessing the cross-memory slot, so an unmodified program may not work completely.

Unfortunately, it is still difficult to say whether your code will work faster in a 6-core, single-node or in an 8-core, dual-node. Even if we could see your code, it would eventually get a reasonable guess. A few things to consider:

  • The penalty for accessing memory with a cross-slot only works when the cache is skipped, so if your program has good cache behavior, NUMA will not harm you;
  • If your streams write everything in the private memory area, and you are limited by the memory write capacity, then a machine with two outlets will help;
  • If you are tied to computing, not bandwidth, then 8 cores are probably better than 6;
  • If your performance is limited by cache misses, then the six-core single-network box starts to look better:
  • If you have a lot of lock conflicts or write to shared data, then this usually indicates a single connection.

There are many variables there, so it’s best to ask your HP reseller for borrowing machines that match the configurations you are considering. Then you can check your application, see where it works, and order the appropriate equipment.

+2
source

Without detailed information, it is difficult to give a detailed answer. However, I hope the following helps you solve the problem.

If your thread code is correct (for example, you are properly blocking shared resources), you should not experience errors caused by a change in hardware architecture. Invalid streaming code can sometimes be masked by the specifics of how a particular platform handles things like accessing / sharing the CPU cache.

You may encounter a change in application performance to an equivalent kernel due to different approaches to managing memory and cache in single-chip, multi-core and alternative versions with multiple chips.

In particular, if you look at hardware that has separate memory for the processor, I would suggest that each thread will be blocked for the processor on which it is running (otherwise the system will have to bear significant overhead for moving the thread's memory to memory, intended for another kernel). This can reduce the overall system efficiency, depending on the specific situation. However, separate core memory also means that different processors do not compete with each other for a specific cache line (4 cores on each of the two processors still potentially compete for cache lines, but this is less than if 6 cores competing for the same cache lines).

This type of cache line argument is called False Sharing. I suggest reading the following to understand if this might be the problem you are facing

http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3

The bottom line indicates that the behavior of the application should be stable (with the exception of things that naturally depend on the details of thread scheduling) if you followed the proper methods of thread development, but performance can go anyway depending on what you do .

+1
source

Source: https://habr.com/ru/post/1437864/


All Articles