How to calculate possible subsequences of words matching a pattern?

Suppose I have a sequence:

Seq = 'hello my name' 

and line:

  Str = 'hello hello my friend, my awesome name is John, oh my god!' 

And then I look for matches for my sequence within the string, so I get the index of the word "every word" for each word in the sequence in the cell array, so the first element is the cell containing the matches for 'hello', the second element contains the matches for 'my', and the third for 'name'.

  Match = {[1 2]; %'hello' matches [3 5 11]; %'my' matches [7]} %'name' matches 

I need a code to somehow get an answer saying there are possible subsequence matches:

  Answer = [1 3 7; %[hello my name] 1 5 7; %[hello my name] 2 3 7; %[hello my name] 2 5 7;] %[hello my name] 

Thus, the โ€œAnswerโ€ contains all possible ordered sequences (why mine (word 11) never appears in the โ€œAnswerโ€, after position 11 there should be a โ€œnameโ€.

NOTE. The length and number of matches of "Seq" may vary.

+3
matlab sequence word distance
Feb 18 '14 at 23:21
source share
1 answer

Since the length of the Matches can vary, you need to use comma separated lists along with ndgrid to generate all combinations (the approach is similar to that used in this other answer ). Then filter out the combinations where the indices do not increase using diff and logical indexing :

 cc = cell(1,numel(Match)); %// pre-shape to be used for ndgrid output [cc{end:-1:1}] = ndgrid(Match{end:-1:1}); %// output is a comma-separated list cc = cellfun(@(v) v(:), cc, 'uni', 0) %// linearize each cell combs = [cc{:}]; %// concatenate into a matrix ind = all(diff(combs.')>0); %'// index of wanted combinations combs = combs(ind,:); %// remove unwanted combinations 

The desired result is in the combs variable. In your example

 combs = 1 3 7 1 5 7 2 3 7 2 5 7 
+4
Feb 18 '14 at 23:31
source share



All Articles