C ++ Multithreaded Independent Data Performance

Question

C ++ Multithreaded Independent Data Performance

Let's have a very simple C ++ class with only one data element:

class Container {
public:
    std::vector<Element> elements;
    Container(int elemCount);
};

Now create N threads that perform a very simple task:

create a local container with a specific vector size
swipe the vector and just increase each element of val
Repeat step 2 10,000 times (to get the time in seconds instead of ms)

A complete list of codes can be found on Pastebin

According to CoreInfo, my processor (Intel Core i5 2400) has 4 cores and each of them has its own L1 / L2 caches:

Logical to Physical Processor Map:
*---  Physical Processor 0
-*--  Physical Processor 1
--*-  Physical Processor 2

Logical Processor to Cache Map:
*---  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
*---  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
*---  Unified Cache       0, Level 2,  256 KB, Assoc   8, LineSize  64
-*--  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
-*--  Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
-*--  Unified Cache       1, Level 2,  256 KB, Assoc   8, LineSize  64
--*-  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
--*-  Instruction Cache   2, Level 1,   32 KB, Assoc   8, LineSize  64
--*-  Unified Cache       2, Level 2,  256 KB, Assoc   8, LineSize  64
---*  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
---*  Instruction Cache   3, Level 1,   32 KB, Assoc   8, LineSize  64
---*  Unified Cache       3, Level 2,  256 KB, Assoc   8, LineSize  64
****  Unified Cache       4, Level 3,    6 MB, Assoc  12, LineSize  64
---*  Physical Processor 3

For a vector size of up to 100,000 elements, timing is exactly as expected:

Elements count: 100.000

Threads: 1
loops: 10000 ms: 650

Threads: 4
loops: 2500 ms: 168
loops: 2500 ms: 169
loops: 2500 ms: 169
loops: 2500 ms: 171

However, for large vector sizes, the performance of several cores:

Elements count: 300.000

Threads: 1
loops: 10000 ms: 1968

Threads: 4
loops: 2500 ms: 3817
loops: 2500 ms: 3864
loops: 2500 ms: 3927
loops: 2500 ms: 4008

My questions:

- , ? ? , , , L1/L2 ?
( ) ?

EDIT: , . :

@user2079303: memeber. SizeOf () = 8. . Pastebin.

@bku_drytt: (). , elemCount ( ).

@Jorge González Lorenzo: L3. , :

Elements count: 50.000
Threads: 1
loops: 50000 ms: 1615

Elements count: 200.000 (4 times bigger)
Threads: 1
loops: 50000 ms: 1615 (slightly more than 4 time bigger)

Elements count: 800.000 (even 4 times bigger)
Threads: 1
loops: 50000 ms: 42181 (MUCH more than 4 time bigger)

+4

c++ performance multithreading

Marcel 16 . '15 10:52

1

Tim B · Answer 1 · 2015-11-24T10:39:20+0000

L3 4 ( x4-, ) , . L1 L2 , L3 - . x4 4 .

C ++ Multithreaded Independent Data Performance

More articles: