What can I do to speed up this code (string similarity)?

Question

What can I do to speed up this code (string similarity)?

This is code written in C ++ using standard libraries to find String similarity of string S with each of these suffixes.

Although it gives the correct conclusion, large lines require a lot of time. Here is the code:

#include <iostream> #include <string> using namespace std; int sim(string a, string b){ int count=0; int sa=a.size(); int sb=b.size(); int iter; if(sa>sb) iter=sb; else iter=sa; for(int i=0; i<iter; i++){ if (a[i]!=b[i]) break; else count++; } return count; } int strsim(string a){ int sum=0; int s=a.size(); for(int i=0; i<s; i++){ sum=sum+sim(a,a.substr(i)); } return sum; } int main(){ int n; cin >> n; string a[n]; for(int i=0; i<n; i++){ cin >> a[i]; } for(int i=0; i<n; i++){ cout << strsim(a[i]) << '\n'; } }

Restrictions: The length of each line is not more than 100,000 and contains only lowercase characters, and the number of test boxes "n" cannot exceed 10.

I / O Example:

Input:

 1 ababaa

Output:

ie, 6 + 0 + 3 + 0 + 1 + 1 = 11

+6

c ++ string

Ranveer Jul 25 '13 at 18:06

source share

5 answers

The biggest cost you have here is that you pass lines by value - this means that each call to "sim" creates two new lines and copies of them. You must eliminate this and replace it with a step-by-step link.

 #include <iostream> #include <string> #include <vector> using namespace std; size_t compareSubstring(const std::string& str, size_t offset) { size_t count = 0; std::string::const_iterator lhsIt = str.begin(); std::string::const_iterator rhsIt = str.begin() + offset; for ( ; rhsIt != str.end() ; ++lhsIt, ++rhsIt ) { if (*lhsIt != *rhsIt) break; ++count; } return count; } int compareString(const string& str) { size_t count = 0; const size_t length = str.size(); for(size_t i = 0; i < length; ++i) { count += compareSubstring(str, i); } return count; } int main() { size_t numStrings = 0; std::cin >> numStrings; std::vector<std::string> strings; strings.resize(numStrings); for(size_t i = 0; i < numStrings; ++i) { std::cin >> strings[i]; } for(size_t i = 0; i < numStrings; ++i) { std::cout << compareString(strings[i]) << std::endl; } }

+3

kfsone Jul 25 '13 at 18:34

source share

This is as effective as you can do it (without moving to the FFT territory):

sum_i = 0 ^ j sum_j = 0 ^ s f_i, j

 int strsim(const string &a){ int s=a.size(); int sum=0; for(int i=0; i<s; i++){ for (int j=i;j<s;++j){ if (a[ji]!=a[i]) break; sum ++; } } return sum; }

I know that it is different from your code, but (if I don't have an error), it should return the same.

Give it a try and let me know!

0

rabensky Jul 25 '13 at 19:23

source share

First you have to check if there is a better algorithm. Then, depending on how much you are willing to invest and lose portability, you may need to enter the vector code manually if the compiler was unable to do this. Using gcc intrinsics (see for example This tutorial ) I managed to get x6 acceleration by the same code.

0

hivert Jul 25 '13 at 19:42

source share

Adding another line to your strsim () function reduces some of the additional functional calls to your sim () function. As we know, the cost of a function call (consumes additional memory and processing for the prolog and epilogue mechanisms) is quite significant when we talk about performance.

Thus, the sim call function only works if the first character of the two lines in your case, "a" and "a.substring" are equal.

  int strsim(string a){ int sum=0; int s=a.size(); for(int i=0; i<s; i++){ if(a[i] == a[0]) //add this extra line sum=sum+sim(a,a.substr(i)); } return sum; }

-1

Mahesh Jul 25 '13 at 19:01

source share

nims · Accepted Answer · 2013-07-25T18:22:37+0000

Your current code computes for one line for length L, in O(L^3) (substr takes linear runtime). Not to mention the high fixed costs of the aforementioned complexity due to inefficient passage around the lines.

Your algorithm can simply be reduced to searching for the longest common line prefixes with all its suffixes. This can be easily done using Suffix Aray . This concept cannot be explained as an answer, so I highly recommend that you read this .

Suffix Array's optimal and easy-to-code solution is O(Llg^2(L)) (L = line length) build time and O(1) time to request the longest common prefix of 2 suffixes using the Minimum range request . Note that the entire line itself is its own suffix. In your case, you need to make L queries for each row. Thus, the total complexity for one line will be O(Llg^2(L)) + O(L) .

If you want to improve even further, you can reduce the construction time to O(Llg(L)) using radix sorting or to O(L) ( Read )

What can I do to speed up this code (string similarity)?

More articles: