Using SSE to speed up lower_bound


In the project I'm currently working on, I often need to find the smallest possible index in a sorted array into which to insert an element (for example, std :: lower_bound in C ++). It seems pretty attractive to me to use SSE to speed up my algorithm, since I work with uint32 arrays, the size of which is usually the size of the processor cache line. I have never used SSE instructions before, so I can’t figure out what the implementation of this SSE function will look like. Please give tips to help me write it optimally with SSE.

+3
source share
1 answer

Nothing like this std::lower_boundwill scale well using SSE. The reason SSE makes things faster is because it allows you to do multiple calculations at once. For example, one SSE instruction can cause 4 multiplication operations to be performed immediately. However, the method std::lower_boundcannot be parallelized, since each step of the algorithm requires the results of a comparison of the previous steps. In addition, this is O (log n), and, as a result, it is unlikely to be a bottleneck.

Also, before moving on to inline assembly, you should be aware that whenever you use inline assembly, you lose most of the compiler optimizations that may arise in this section of your program, and often your program will be slower - usually compilers write better assembler than we humans.

SSE, intrinsics - "" , , SSE, . Microsoft Visual ++, GNU. (, , . )

, std::lower_bound SSE, . , , lower_bound, , insertion sort, , . , , , , , O (n lg n). , , - std::set, O (lg n) , O (n + lg n) re .

, , :)

+9

Source: https://habr.com/ru/post/1787190/


All Articles