Why would using 8 threads be faster than 4 threads on a quad core Hyper Threaded processor?

Question

Why would using 8 threads be faster than 4 threads on a quad core Hyper Threaded processor?

I have a quad-core i7 920 processor. It's Hyperthreaded, so the computer thinks it has 8 cores.

From what I read in interweb, when doing parallel tasks, I have to use the number of physical cores, not the number of hyper-threaded cores.

So, I did some timings and was surprised that using 8 threads in a parallel loop is faster than using 4 threads.

Why is this? My sample code is too long to post here, but you can find it by running the example here: https://github.com/jsphon/MTVectorizer

The performance graph is shown below:

enter image description here

+6

python numpy parallel-processing numba

Ginger Nov 23 '14 at 10:26

source share

2 answers

Ira Baxter · Answer 1 · 2014-11-23T14:04:41+0000

(Intel) hyper-threading cores act as (up to) two processors.

The observation is that one processor has a set of resources that are ideally busy continuously, but in practice they often sit idle while the processor waits for any external event, usually memory reads or writes.

By adding some additional status information for another hardware thread (for example, another copy of the registers + additional material), a “single” processor can turn its attention to the execution of another thread when the first one is blocked. (You can summarize these N hardware threads, and other architectures did this; Intel ended up with 2).

If both hardware threads spend their time waiting for various events, the CPU may possibly perform the appropriate processing for the hardware threads. 40 nanoseconds to wait for memory is a long time. Therefore, if your program receives a lot of memory, I expect it to look as if both hardware threads were fully efficient, for example, you should get almost 2x.

If two hardware threads perform work that is very local (for example, intensive computations only in registers), then internal expectations become minimal, and one processor cannot switch fast enough to serve both hardware threads as fast as they generate work. In this case, performance will deteriorate. I don’t remember where I heard it, and I heard it a long time ago: under such circumstances, the net effect is more like 1.3x than idealized 2x. (Expecting the SO audience to change me on this).

Your application can switch back and forth according to its needs depending on which part is currently running. Then you get a combination of performance. I am pleased with any speed I can get.

oxymor0n · Answer 2 · 2014-11-23T16:20:47+0000

Ira Baxter explained your question very well, but I want to add one more thing (I can’t comment on his answer because I don’t have enough reputation): there is overhead for moving from one thread to another. This process, called context switching ( http://wiki.osdev.org/Context_Switching#Hardware_Context_Switching ), requires your processor core to change its registers to reflect data in a new stream. This cost is significant if you perform context switching at the process level, but get a little cheaper when you perform switching at the thread level. This means 2 things:

1) Hyper-threading will never give you a theoretical increase in performance, since the cost of context switching is nontrivial. This also explains why very logical flows degrade performance, for Ira: frequent context switching increases this cost.

2) 8 single-threaded processes will run slower than 4 two-threaded processes that perform the same work. So you should use the Python thread library or the huge greenlet library ( https://greenlet.readthedocs.org/en/latest/ ) if you plan on doing multi-threaded work.

Why would using 8 threads be faster than 4 threads on a quad core Hyper Threaded processor?

More articles: