I profiled my program and found out that the hottest point was levenshtein_distance , called recursively. I decided to try and optimize it.
lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 ) { const size_t len1 = s1.size(), len2 = s2.size(); std::vector<unsigned int> col( len2+1 ), prevCol( len2+1 ); const size_t prevColSize = prevCol.size(); for( unsigned int i = 0; i < prevColSize; i++ ) prevCol[i] = i; for( unsigned int i = 0, j; i < len1; ++i ) { col[0] = i+1; const char s1i = s1[i]; for( j = 0; j < len2; ++j ) { const auto minPrev = 1 + std::min( col[j], prevCol[1 + j] ); col[j+1] = std::min( minPrev, prevCol[j] + ( static_cast<unsigned int>( s1i != s2[j] ) ) ); } col.swap( prevCol ); } return prevCol[len2]; }
TL DR: I changed std::string β std::array
War story: And after running vtune on it, I found that the line updating col[j+1] was the one who slowed everything down (90% of the time spent on it). I thought: βWell, maybe this is a smoothing problem, maybe the compiler cannot determine that the character arrays inside the string objects are unexplored, because they are masked by the string interface and spend 90% of their time checking that no other part of the program has changed them.
So, I changed my string to a static array because there is no dynamic memory there, and the next step would be to use restrict to help the compiler. But in the meantime, I decided to check if I achieved any performance by doing this.
lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 ) { const size_t len1 = s1.size(), len2 = s2.size(); static constexpr unsigned MAX_STRING_SIZE = 512; assert(len1 < MAX_STRING_SIZE && len2 < MAX_STRING_SIZE); static std::array<unsigned int, MAX_STRING_SIZE> col, prevCol; for( unsigned int i = 0; i < len2+1; ++i ) prevCol[i] = i;
TL DR : now it is slow.
It so happened that I lost performance. A lot of. Instead of working after ~ 6 seconds, my sample program now works after 44 seconds. Reusing vtune for a profile shows that the function is being called again and again: std::swap (for you, gcc, this is the /move.h bit), which in turn is called std::swap_ranges (bits / stl_algobase. H) .
I assume that std::min is implemented using quicksort , which explains why swapping occurs, but I donβt understand why the exchange in this case takes so long.
EDIT : Compiler Options: I am using gcc with the parameters "-O2 -g -DNDEBUG" and a set of warning specifications.