How to extract character ngram from sentences? - python

The following word2ngrams function extracts the 3grams character from a word:

 >>> x = 'foobar' >>> n = 3 >>> [x[i:i+n] for i in range(len(x)-n+1)] ['foo', 'oob', 'oba', 'bar'] 

This post shows extracting ngrams characters for a single word, Quick implementation of character n-grams using python .

But what if I have sentences and I want to extract character ngrams, is there a faster method different from iteratively calling word2ngram() ?

What will be the regular expression version to achieve the same word2ngram and sent2ngram ? would it be faster?

I tried:

 import string, random, time from itertools import chain def word2ngrams(text, n=3): """ Convert word into character ngrams. """ return [text[i:i+n] for i in range(len(text)-n+1)] def sent2ngrams(text, n=3): return list(chain(*[word2ngrams(i,n) for i in text.lower().split()])) def sent2ngrams_simple(text, n=3): text = text.lower() return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]] # Generate 10000 random strings of length 100. sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)] start = time.time() x = [sent2ngrams(i) for i in sents] print time.time() - start start = time.time() y = [sent2ngrams_simple(i) for i in sents] print time.time() - start print x==y 

[output]:

 0.0205280780792 0.0271739959717 True 

EDITED

The regex method looks elegant, but it performs more slowly than an iterative call to word2ngram() :

 import string, random, time, re from itertools import chain def word2ngrams(text, n=3): """ Convert word into character ngrams. """ return [text[i:i+n] for i in range(len(text)-n+1)] def sent2ngrams(text, n=3): return list(chain(*[word2ngrams(i,n) for i in text.lower().split()])) def sent2ngrams_simple(text, n=3): text = text.lower() return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]] def sent2ngrams_regex(text, n=3): rgx = '(?=('+'\S'*n+'))' return re.findall(rgx,text) # Generate 10000 random strings of length 100. sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)] start = time.time() x = [sent2ngrams(i) for i in sents] print time.time() - start start = time.time() y = [sent2ngrams_simple(i) for i in sents] print time.time() - start start = time.time() z = [sent2ngrams_regex(i) for i in sents] print time.time() - start print x==y==z 

[output]:

 0.0211708545685 0.0284190177917 0.0303599834442 True 
+4
source share
1 answer

Why not just (?=(...))

to change . Same thing but no spaces (?=(\S\S\S))
edit2 You can use only what you want. Ex. uses alphanum only (?=([^\W_]{3}))

Uses a view to capture 3 characters. Then the engine strikes up 1 time each
coincidence. Then captures the next 3.

The result of foobar is Foo
Oob
Oba
bar

  # Compressed regex # (?=(...)) # Expanded regex (?= # Start Lookahead assertion ( # Capture group 1 start . # dot - metachar, matches any character except newline . # dot - metachar . # dot - metachar ) # Capture group 1 end ) # End Lookahead assertion 
+1
source

Source: https://habr.com/ru/post/984295/


All Articles