Actually, I am going to put these comments in response.
Threading threads for trivial operations simply adds overhead.
First, make sure your compiler uses vector instructions to implement your loop. (If he doesn’t know how to do this, you may have to code the vector instructions yourself, try finding the “instrinsics” SSE. But for this kind of simple addition of vectors, automatic vectorization should be possible.)
Assuming your compiler is fairly modern GCC, call it with
gcc -O3 -march=native ...
Add -ftree-vectorizer-verbose=2 to find out if it automatically or not automatically starts your loop and why.
If you are already using vector instructions, you might be filling up your memory bandwidth. Modern processor cores are pretty fast ... If so, you need to restructure at a higher level to get more operations within each iteration of the loop, finding ways to perform a large number of operations on blocks that fit into the L1 cache.
source share