Why is std :: pair faster than std :: tuple

Here is the code to test.

Tuple test:

using namespace std; int main(){ vector<tuple<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_tuple(var, var)); } } 

Para test:

 #include <vector> using namespace std; int main(){ vector<pair<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_pair(var, var)); } } 

I took a time measurement using the Linux time command. Results:

 | | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 8.9 s | 1.60 s | | Tuple | 19.8 s | 1.96 s | 

I am wondering why there is such a big difference between the two data structures in O0, since they should be very similar. There is a slight difference in 02.

Why is the difference in O0 so big, and why is there any difference?

EDIT:

Code with v.resize ()

Couple:

 #include <vector> using namespace std; int main(){ vector<pair<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_pair(var, var); } } 

Tuple:

 #include<tuple> #include<vector> using namespace std; int main(){ vector<tuple<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_tuple(var, var); } } 

Results:

 | | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 5.01 s | 0.77 s | | Tuple | 10.6 s | 0.87 s | 

EDIT:

My system

 g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7) GLIBCXX_3.4.19 
+42
performance c ++ 11 std-pair stdtuple
Nov 11 '14 at 11:30
source share
2 answers

You lack important information: which compiler are you using? What do you use to measure micro lens performance? What standard library implementation are you using?

My system:

 g++ (GCC) 4.9.1 20140903 (prerelease) GLIBCXX_3.4.20 

In any case, I ran your examples, but first saved the required size of the vectors to get rid of the memory overhead. At the same time, I am funny watching the opposite of something interesting - the opposite of what you see:

 g++ -std=c++11 -O2 pair.cpp -o pair perf stat -r 10 -d ./pair Performance counter stats for './pair' (10 runs): 1647.045151 task-clock:HG (msec) # 0.993 CPUs utilized ( +- 1.94% ) 346 context-switches:HG # 0.210 K/sec ( +- 40.13% ) 7 cpu-migrations:HG # 0.004 K/sec ( +- 22.01% ) 182,978 page-faults:HG # 0.111 M/sec ( +- 0.04% ) 3,394,685,602 cycles:HG # 2.061 GHz ( +- 2.24% ) [44.38%] 2,478,474,676 stalled-cycles-frontend:HG # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%] 1,550,747,174 stalled-cycles-backend:HG # 45.68% backend cycles idle ( +- 1.60% ) [44.66%] 2,837,484,461 instructions:HG # 0.84 insns per cycle # 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%] 526,077,681 branches:HG # 319.407 M/sec ( +- 4.52% ) [55.82%] 829,623 branch-misses:HG # 0.16% of all branches ( +- 4.42% ) [55.74%] 594,396,822 L1-dcache-loads:HG # 360.887 M/sec ( +- 4.74% ) [55.59%] 20,842,113 L1-dcache-load-misses:HG # 3.51% of all L1-dcache hits ( +- 0.68% ) [55.46%] 5,474,166 LLC-loads:HG # 3.324 M/sec ( +- 1.81% ) [44.23%] <not supported> LLC-load-misses:HG 1.658671368 seconds time elapsed ( +- 1.82% ) 

against

 g++ -std=c++11 -O2 tuple.cpp -o tuple perf stat -r 10 -d ./tuple Performance counter stats for './tuple' (10 runs): 996.090514 task-clock:HG (msec) # 0.996 CPUs utilized ( +- 2.41% ) 102 context-switches:HG # 0.102 K/sec ( +- 64.61% ) 4 cpu-migrations:HG # 0.004 K/sec ( +- 32.24% ) 181,701 page-faults:HG # 0.182 M/sec ( +- 0.06% ) 2,052,505,223 cycles:HG # 2.061 GHz ( +- 2.22% ) [44.45%] 1,212,930,513 stalled-cycles-frontend:HG # 59.10% frontend cycles idle ( +- 2.94% ) [44.56%] 621,104,447 stalled-cycles-backend:HG # 30.26% backend cycles idle ( +- 3.48% ) [44.69%] 2,700,410,991 instructions:HG # 1.32 insns per cycle # 0.45 stalled cycles per insn ( +- 1.66% ) [55.94%] 486,476,408 branches:HG # 488.386 M/sec ( +- 1.70% ) [55.96%] 959,651 branch-misses:HG # 0.20% of all branches ( +- 4.78% ) [55.82%] 547,000,119 L1-dcache-loads:HG # 549.147 M/sec ( +- 2.19% ) [55.67%] 21,540,926 L1-dcache-load-misses:HG # 3.94% of all L1-dcache hits ( +- 2.73% ) [55.43%] 5,751,650 LLC-loads:HG # 5.774 M/sec ( +- 3.60% ) [44.21%] <not supported> LLC-load-misses:HG 1.000126894 seconds time elapsed ( +- 2.47% ) 

as you can see, in my case, the reason is a much larger number of stopped cycles, both in the interface and in the backend.

Now where did it come from? I bet this comes down to some bad insertion, similar to what is explained here: std :: vector performance regression when C ++ 11 is turned on

Indeed, including -flto aligns the results for me:

 Performance counter stats for './pair' (10 runs): 1021.922944 task-clock:HG (msec) # 0.997 CPUs utilized ( +- 1.15% ) 63 context-switches:HG # 0.062 K/sec ( +- 77.23% ) 5 cpu-migrations:HG # 0.005 K/sec ( +- 34.21% ) 195,396 page-faults:HG # 0.191 M/sec ( +- 0.00% ) 2,109,877,147 cycles:HG # 2.065 GHz ( +- 0.92% ) [44.33%] 1,098,031,078 stalled-cycles-frontend:HG # 52.04% frontend cycles idle ( +- 0.93% ) [44.46%] 701,553,535 stalled-cycles-backend:HG # 33.25% backend cycles idle ( +- 1.09% ) [44.68%] 3,288,420,630 instructions:HG # 1.56 insns per cycle # 0.33 stalled cycles per insn ( +- 0.88% ) [55.89%] 672,941,736 branches:HG # 658.505 M/sec ( +- 0.80% ) [56.00%] 660,278 branch-misses:HG # 0.10% of all branches ( +- 2.05% ) [55.93%] 474,314,267 L1-dcache-loads:HG # 464.139 M/sec ( +- 1.32% ) [55.73%] 19,481,787 L1-dcache-load-misses:HG # 4.11% of all L1-dcache hits ( +- 0.80% ) [55.51%] 5,155,678 LLC-loads:HG # 5.045 M/sec ( +- 1.69% ) [44.21%] <not supported> LLC-load-misses:HG 1.025083895 seconds time elapsed ( +- 1.03% ) 

and for the tuple:

 Performance counter stats for './tuple' (10 runs): 1018.980969 task-clock:HG (msec) # 0.999 CPUs utilized ( +- 0.47% ) 8 context-switches:HG # 0.008 K/sec ( +- 29.74% ) 3 cpu-migrations:HG # 0.003 K/sec ( +- 42.64% ) 195,396 page-faults:HG # 0.192 M/sec ( +- 0.00% ) 2,103,574,740 cycles:HG # 2.064 GHz ( +- 0.30% ) [44.28%] 1,088,827,212 stalled-cycles-frontend:HG # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%] 697,438,071 stalled-cycles-backend:HG # 33.15% backend cycles idle ( +- 0.41% ) [44.76%] 3,305,631,646 instructions:HG # 1.57 insns per cycle # 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%] 675,175,757 branches:HG # 662.599 M/sec ( +- 0.16% ) [56.02%] 656,205 branch-misses:HG # 0.10% of all branches ( +- 0.98% ) [55.93%] 475,532,976 L1-dcache-loads:HG # 466.675 M/sec ( +- 0.13% ) [55.69%] 19,430,992 L1-dcache-load-misses:HG # 4.09% of all L1-dcache hits ( +- 0.20% ) [55.49%] 5,161,624 LLC-loads:HG # 5.065 M/sec ( +- 0.47% ) [44.14%] <not supported> LLC-load-misses:HG 1.020225388 seconds time elapsed ( +- 0.48% ) 

So, remember, -flto is your friend, and a bad insert can have extreme results with highly template code. Use perf stat to find out what happens.

+61
Nov 11 '14 at 12:01
source share

milianw did not address -O0 vs. -O2 , so I would like to add an explanation for this.

It is expected that std::tuple will be slower than std::pair if it is not optimized, because it is a more complex object. A pair has exactly two members, so its methods are easy to identify. But a tuple has an arbitrary number of members, and the only way to iterate over a list of template templates is through recursion. Therefore, most functions for a tuple process one element and then recurs to process the rest, so for 2-tuples you have twice as many function calls.

Now that they are optimized, the compiler will inline this recursion and there should be no significant difference. Which tests clearly confirm. This applies to highly standardized materials in general. Templates can be written to provide abstraction without overhead or too little runtime, but it depends on optimizations to build in all the trivial functions.

+34
Nov 11 '14 at 1:04
source share



All Articles