It really depends on the texts you are comparing. Below I present two accelerated matches within the original editing rank.
We once had the same task in which we combined a short sequence of words (about 10-30 characters) with a dictionary> 300 thousand short sentences (also 10-30 characters each). In this case, the following approach saved us a lot of time:
- sort the dictionary of target lines (this needs to be done only once)
- when you build the table n * m of row
i , you can reuse the table from row i-1 , since most of the rows are shared.
eg. if you have two lines of "list of strings" and the next "list of words" , you can reuse the first 8 rows of your table and only recount 5 (both lines have 8 characters). Thus, we saved up to 70-80% of the runtime with minor changes in our code.
If instead you have several long texts, the first approach will not save you. But in this case, you expect that only a few records have a small editing distance, and all the others have a huge distance. Since the n * m table is somewhat monotonous in each direction (i.e. the minimum of each row is monotonous, as for each column), you can stop the calculation as soon as you reach the specified threshold. You can even save intermediate results and restart the calculation (with a higher limit) if you do not find a solution within your initial threshold.
source share