Nothing like this std::lower_boundwill scale well using SSE. The reason SSE makes things faster is because it allows you to do multiple calculations at once. For example, one SSE instruction can cause 4 multiplication operations to be performed immediately. However, the method std::lower_boundcannot be parallelized, since each step of the algorithm requires the results of a comparison of the previous steps. In addition, this is O (log n), and, as a result, it is unlikely to be a bottleneck.
Also, before moving on to inline assembly, you should be aware that whenever you use inline assembly, you lose most of the compiler optimizations that may arise in this section of your program, and often your program will be slower - usually compilers write better assembler than we humans.
SSE, intrinsics - "" , , SSE, . Microsoft Visual ++, GNU. (, , . )
, std::lower_bound SSE, . , , lower_bound, , insertion sort, , . , , , , , O (n lg n). , , - std::set, O (lg n) , O (n + lg n) re .
, , :)