Modern processors 1 process RAM locally and use a separate channel 2 for communication between them. This is a consumer-level version of NUMA created for supercomputers over ten years ago.
The idea is to avoid using a shared bus (the old FSB), which can cause severe conflicts because each core uses it to access memory. When you add more NUMA cells, you get higher throughput. The disadvantage is that the memory becomes uneven from the point of view of the processor: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they are trying to reduce the transfer of tasks from one cell to another. Sometimes itβs good to switch from one core to another in the same socket; sometimes there is a whole hierarchy that determines which resources (1-, 2-, 3-level cache, RAM channel, IO, etc.) are shared and which are not, and this determines whether the penalty will be or not by moving task, Sometimes it can determine that waiting for the correct kernel will be pointless, and itβs best to dig it all into another socket ....
In the vast majority of cases, it is better to leave the scheduler to do what he knows best. If not, you can play with numactl .
As for the specific case of this program; the best architecture is highly dependent on the level of resource sharing between threads. If each thread has its own playground and basically works only inside it, a reasonably reasonable allocator will prioritize the local RAM, which makes it less important which cell each stream will be in.
If, on the other hand, objects are distributed by one thread, processed by another and consumed by a third; performance will suffer if they are not in the same cell. You can try to create small groups of flows and limit the intensive exchange within the group, then each group can easily switch to another cell.
The worst case is when all the threads are involved in a big data exchange orgy. Even if you have all your locks and processes well-debugged, there will be no way to optimize it to use more cores than what is available in the cell. Perhaps it would be even better to limit the entire process to simply using cores in one cell, effectively spending the rest.
1 modern, I mean any AMD-64-bit chip, and Nehalem or better for Intel.
2 AMD calls this HyperTransport channel, and Intel's name is QuickPath Interconnect
EDIT:
You note that you are initializing a "large chunk of read-only memory". And then create a lot of threads to work on it. If each thread works on its own part of this piece, then it would be much better if you initialized it in the thread after it appeared. This will allow threads to span multiple cores, and the allocator will choose a local RAM for each, much more efficient layout. Maybe there is some way to hint to the scheduler to transfer the threads as soon as they appear, but I do not know the details.
EDIT 2:
If your data is read verbatim on disk, without any processing, it may be useful to use mmap instead of highlighting a large fragment and read() ing. There are some common benefits:
- No need to pre-allocate RAM.
- The
mmap operation is almost instantaneous and you can start using it. Data will be considered lazy as needed. - The OS may be more reasonable than you, choosing between the application,
mmap ed RAM, buffers and cache. - it's less code!
- Not required data will not be read, will not use RAM.
- You can specify it as read-only. Any error that attempts to write will cause coredump.
- Since the OS knows this for read only, it cannot be dirty, so if RAM is required, it will simply drop it and re-read it if necessary.
but in this case you also get:
- Because data is read lazily, each RAM page will be selected after the threads have been distributed to all available cores; this will allow the OS to select pages close to the process.
So, I think that if two conditions are satisfied:
- data is not processed in any way between the disk and RAM.
- each piece of data is read (mostly) by one stream that is not affected by all of them.
then using mmap you can use machines of any size.
If each piece of data is read by more than one stream, perhaps you can determine which streams will (basically) share the same pages, and try to hint to the scheduler to keep them in the same NUMA cell.