How tokenize a striped line based on a list of patterns

Question

How tokenize a striped line based on a list of patterns

For the string S and the list of L patterns [L ₁ , ..., L _n ], how would you find a list of all tokens in S by matching the pattern in L and so that the maximum number of matching letters in S is maximized?

An example would be S = "thenuke", L = {"the", "then", "nuke"}, and we would like to get ["the", "nuke"], as if we started with matching "then" , we don’t get a solution maximizing the total number of letters in S that match.

I looked at other SO questions, string matching algorithms , but did not find anything to efficiently resolve part of maximizing the problem. This should be studied, for example, in bioinformatics, but I'm not in this area, so any help (including a link to scientific articles) is deeply appreciated!

+4

string algorithm

marchelbling Sep 11 '12 at 20:14

source share

2 answers

You can solve this in time O (| L || S |) using dynamic programming : create an iterative table that gives the best match for each initial substring S = s ₁ s ₂ ... s _n :

B (0), the best match for the original zero-length substring in S is an empty match.
Suppose that we have already calculated the best fit B (i) for each i <k, and we now want to calculate B (k). Let p be a sample in L with length | p |, and let j = k - | p | + 1. If p = s _j ... s _k , then there is a correspondence for s ₁ s ₂ ... s _k , which consists of B (j), followed by p. Let B (k) be the best such match found after considering all the patterns in L.
B (n) is the best match for all S.

+1

Gareth rees Sep 11 '12 at 22:25

source share

Running wild · Accepted Answer · 2012-09-11T23:06:38+0000

This can be resolved in O (| S | + | L | + k) time, where k is the total number of matches of all lines from L to S. There are two main steps:

Launch Aho-Corasick . This will give you all matches of any string from L to S. This is done at the same time as above.
Initialize an array, A, of integers of length | S | + 1 to all zeros. March through the array, in position I set A [i] to A [i-1], if it is greater, then for each match M from L to S in position I set A [i + | M |] in max A [i + | M |] and A [i] + | M |.

Here is the code in Go that does just that. It uses the package I wrote, which has a convenient wrapper for calling Aho-Corasick.

package main import ( "fmt" "github.com/runningwild/stringz" ) func main() { input := []byte("thenuke") patterns := [][]byte{[]byte("hen"), []byte("thenu"), []byte("uke")} // This runs Aho-Corasick on the input string and patterns, it returns a // map, matches, such that matches[i] is a list of indices into input where // patterns[i] matches the input string. The running time of this is // O(|input| + |patterns| + k) and uses O(|patterns| + k) auxillary storage, // where k is the total number of matches found. find := stringz.FindSet(patterns) matches := find.In(input) // We want to invert the map so that it maps from index to all strings that // match at that index. at_pos := make([][]int, len(input)+1) for index, hits := range matches { for _, hit := range hits { at_pos[hit] = append(at_pos[hit], index) } } // Now we do a single pass through the string, at every position we will // keep track of how many characters in the input string we can match up to // that point. max := make([]int, len(input)+1) for i := range max { // If this position isn't as good as the previous position, then we'll use // the previous position. It just means that there is a character that we // were unable to match anything to. if i > 0 && max[i-1] > max[i] { max[i] = max[i-1] } // Look through any string that matches at this position, if its length is // L, then L positions later in the string we can have matched L more // character if we use this string now. This might mean that we don't use // another string that earlier we thought we'd be matching right now, we'll // find out later which one was better. for _, hit := range at_pos[i] { L := len(patterns[hit]) if i+L < len(max) && max[i+L] < max[i]+L { max[i+L] = max[i] + L } } } fmt.Printf("%v\n", max) }

How tokenize a striped line based on a list of patterns

More articles: