Since you specifically mention many threads, I assume that you have at least a multi-network system. Typically, memory banks are associated with processor sockets. That is, one of the processors is โclosestโ to its own memory banks and must interact with other memopry memory controllers to access data in other banks. (The processor here means the physical thing in the socket)
When distributing data, the first record policy is usually used to determine which memory banks your data will be allocated to, which means that it can access it faster than other processors.
So, at least for several processors (and not just for several cores) there should be a performance improvement when distributing a copy for at least each processor. It is necessary to select / copy data with each processor / thread, and not from the main thread (to use the first-write policy). You also need to make sure that threads will not migrate between processors, because then you are likely to lose a tight connection to your memory.
I'm not sure how copying data for each thread on a single processor will affect performance, but I think that not copying could improve the ability to share the contents of higher-level caches that are shared between cores.
In any case, compare and decide based on actual measurements.
source share