Approximate string list search

So yes, I read about how editing distance can be used between lines to decide how to “close” two lines to each other. This algorithm, implemented as a dynamic problem, takes time O (mn), where m and n are the lengths of the text and picture, respectively. Therefore, if I need to match a string with 5,000 odd other strings, it will take a lot of time, which is simply unacceptable in my application. Is there a faster solution that can be implemented? I do not mind trading space for a while.

I saw an application called "Swype" on Android that does something similar. He searches for your query against his own database and offers the results. How does it work so fast?

Note Do not offer frameworks like Lucene, because I cannot run J2ME.

+6
source share
4 answers

splix answer is good. As another option (for very large sets of strings) you might consider using the n-gram view:

http://en.wikipedia.org/wiki/N-gram

They are used to approximate pattern matching across multiple database packages because they are quickly and easily implemented using common indexing methodologies.

+2
source

We used http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for almost the same thing, and it worked for us.

There are several Java implementations, you can find them on the Internet

PS you can also check other string matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm

+1
source

It really depends on the texts you are comparing. Below I present two accelerated matches within the original editing rank.

We once had the same task in which we combined a short sequence of words (about 10-30 characters) with a dictionary> 300 thousand short sentences (also 10-30 characters each). In this case, the following approach saved us a lot of time:

  • sort the dictionary of target lines (this needs to be done only once)
  • when you build the table n * m of row i , you can reuse the table from row i-1 , since most of the rows are shared.

eg. if you have two lines of "list of strings" and the next "list of words" , you can reuse the first 8 rows of your table and only recount 5 (both lines have 8 characters). Thus, we saved up to 70-80% of the runtime with minor changes in our code.

If instead you have several long texts, the first approach will not save you. But in this case, you expect that only a few records have a small editing distance, and all the others have a huge distance. Since the n * m table is somewhat monotonous in each direction (i.e. the minimum of each row is monotonous, as for each column), you can stop the calculation as soon as you reach the specified threshold. You can even save intermediate results and restart the calculation (with a higher limit) if you do not find a solution within your initial threshold.

+1
source

It is also a question of how you define "close." If you do not insist on writing, but say that it will work too, I can offer soundex . Its a very fast algorithm to see if there are 2 phonetic closure words.

0
source

Source: https://habr.com/ru/post/891422/


All Articles