Optimization of the `std :: vector operator []` (vector access) when it becomes a bottleneck

gprof says my high-performance application spends 53% of its time inside std::vector <...> operator [] (unsigned long) , 32% of which falls on one heavily used vector. Even worse, I suspect that my parallel code cannot scale beyond 3-6 cores due to the memory bottleneck associated with it. Although my application does spend a lot of time accessing and writing memory, it seems that I should (or at least try) to do better than 52%. Should I try to use dynamic arrays (the size in most cases remains constant)? Maybe this will help to deal with possible bottlenecks?

Actually, my preferred solution would be to solve the bottleneck and leave the vectors as convenient as possible. Based on the foregoing, are there any potential culprits or solutions (tcmalloc missing)?

+6
source share
5 answers

Have you studied your memory access pattern yourself? This may be inefficient - the cache is unfriendly.

+3
source

Have you tried to use a raw pointer while accessing an array?

 // regular place for (int i = 0; i < arr.size(); ++i) wcout << arr[i]; // In bottleneck int *pArr = &arr.front(); for (int i = 0; i < arr.size(); ++i) wcout << pArr[i]; 
+2
source

I suspect gprof prevents function inclusion. Try a different profiling method. std::vector operator [] cannot be a bottleneck because it is not much different from accessing a raw array. The following is an example implementation of SGI:

 reference operator[](size_type __n) { return *(begin() + __n); } iterator begin() { return _M_start; } 
+2
source

You cannot trust gprof for high-speed code profiling; instead, you should use a passive profiling method like oprofile to get a real image.

Alternatively, you can profile by manually changing the code (for example, invoke the calculation 10 times instead of one and check how long the execution time increases). Note that this, however, will be affected by cache issues, so YMMV.

+1
source

The vector class really likes and provides a certain amount of convenience due to performance, which is great when you do not really need performance.

If you really need performance, it wonโ€™t be too painful to go around the vector class and go directly to the simple old handmade array, whether statically or dynamically distributed. Then 1) the time during which you are currently indexing should disappear significantly, speeding up your application by this amount, and 2) you can proceed to the fact that the โ€œnext big thingโ€ takes time in your application.

EDIT: Most programs have much more speedup options than you might expect. To illustrate this, I did a cross-cutting project . If I can sum it up very quickly, it will look like this:

  • The original time is 2.7 ms per โ€œjobโ€ (the number of โ€œjobsโ€ can vary to get enough lead time to analyze it).

  • The first section showed that approximately 60% of the time was spent in vector operations, including indexing, adding, and deleting. I replaced a similar vector class from MFC, and the time was reduced to 1.8 ms / task. (This is an acceleration of 1.5 or 50%.)

  • Even with this array class, approximately 40% of the time was spent in [] indexing the operator. I wanted it to be indexed directly, so I made it index directly, and not through the operator function. This reduced the time to 1.5 ms / task, acceleration by 1.2 times.

  • Now about 60% of the time adds / removes elements in arrays. An additional share was spent on โ€œnewโ€ and โ€œdeleteโ€. I decided to pull out the arrays and do two things. One of them was to use self-linked lists and to combine used objects. First shortened time to 1.3 ms (1.15x). The second reduced it to 0.44 ms (2.95x).

  • At that time, I found that about 60% of the time was in the code that I wrote to index the list (as if it were an array). I decided that this could be done simply by pointing directly to the list. Result: 0.14 ms (3.14x).

  • Now I find that almost all the time is spent on the diagnostic I / O line that I printed on the console. I decided to get rid of this: 0.0037 ms (38x).

I could keep going, but I stopped. The total time per task was reduced using a mixed coefficient of about 700x.

What I want you to take away is that you need the performance to be bad enough to deviate from what can be considered an accepted way of doing things, you donโ€™t need to stop after one bottleneck. Just because you got a lot of acceleration does not mean that they are no more. In fact, the next bottleneck may be larger than the first, in terms of acceleration coefficient. So raise your expectations for the acceleration you can get and go for the scrapping.

0
source

Source: https://habr.com/ru/post/902434/


All Articles