The following word2ngrams function extracts the 3grams character from a word:
>>> x = 'foobar' >>> n = 3 >>> [x[i:i+n] for i in range(len(x)-n+1)] ['foo', 'oob', 'oba', 'bar']
This post shows extracting ngrams characters for a single word, Quick implementation of character n-grams using python .
But what if I have sentences and I want to extract character ngrams, is there a faster method different from iteratively calling word2ngram() ?
What will be the regular expression version to achieve the same word2ngram and sent2ngram ? would it be faster?
I tried:
import string, random, time from itertools import chain def word2ngrams(text, n=3): """ Convert word into character ngrams. """ return [text[i:i+n] for i in range(len(text)-n+1)] def sent2ngrams(text, n=3): return list(chain(*[word2ngrams(i,n) for i in text.lower().split()])) def sent2ngrams_simple(text, n=3): text = text.lower() return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]
[output]:
0.0205280780792 0.0271739959717 True
EDITED
The regex method looks elegant, but it performs more slowly than an iterative call to word2ngram() :
import string, random, time, re from itertools import chain def word2ngrams(text, n=3): """ Convert word into character ngrams. """ return [text[i:i+n] for i in range(len(text)-n+1)] def sent2ngrams(text, n=3): return list(chain(*[word2ngrams(i,n) for i in text.lower().split()])) def sent2ngrams_simple(text, n=3): text = text.lower() return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]] def sent2ngrams_regex(text, n=3): rgx = '(?=('+'\S'*n+'))' return re.findall(rgx,text)
[output]:
0.0211708545685 0.0284190177917 0.0303599834442 True