The first step is to create a function that can generate n-grams for a given string. One way to do this in a vectorized way is with some smart indexing.
function [subStrings, counts] = n_gram(fullString, N)
if (N == 1)
[subStrings, ~, index] = unique(cellstr(fullString.')); %.'
else
nString = numel(fullString);
index = hankel(1:(nString-N+1), (nString-N+1):nString);
[subStrings, ~, index] = unique(cellstr(fullString(index)));
end
counts = accumarray(index, 1);
end
HANKEL, , N- . N . CELLSTR . UNIQUE , ACCUMARRAY ( ).
n-, , INTERSECT:
subStrings1 = n_gram('tool',2);
subStrings2 = n_gram('fool',2);
sharedStrings = intersect(subStrings1,subStrings2);
nShared = numel(sharedStrings);