N-grams in python, four, five, six grams?

I am looking for a way to split text into n-grams. Usually I would do something like:

import nltk from nltk import bigrams string = "I really like python, it pretty awesome." string_bigrams = bigrams(string) print string_bigrams 

I know that nltk only offers bigrams and trigrams, but is there a way to divide my text into four grams, five grams, or even one hundred grams?

Thanks!

+87
source share
13 answers

Great Python-based answers provided by other users. But here is the nltk approach (just in case, the OP gets a fine for reinventing what already exists in the nltk library).

There is an ngram module that people rarely use in nltk . This is not because it is difficult to read ngrams, but training the model base on ngrams, where n> 3 will lead to a significant dispersion of data.

 from nltk import ngrams sentence = 'this is a foo bar sentences and i want to ngramize it' n = 6 sixgrams = ngrams(sentence.split(), n) for grams in sixgrams: print grams 
+143
source

I am surprised that this has not yet seemed:

 In [34]: sentence = "I really like python, it pretty awesome.".split() In [35]: N = 4 In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)] In [37]: for gram in grams: print gram ['I', 'really', 'like', 'python,'] ['really', 'like', 'python,', "it's"] ['like', 'python,', "it's", 'pretty'] ['python,', "it's", 'pretty', 'awesome.'] 
+45
source

here is another easy way for n-grams

 >>> from nltk.util import ngrams >>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams" >>> tokenize = nltk.word_tokenize(text) >>> tokenize ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] >>> bigrams = ngrams(tokenize,2) >>> bigrams [('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')] >>> trigrams=ngrams(tokenize,3) >>> trigrams [('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')] >>> fourgrams=ngrams(tokenize,4) >>> fourgrams [('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')] 
+9
source

Using only nltk tools

 from nltk.tokenize import word_tokenize from nltk.util import ngrams def get_ngrams(text, n ): n_grams = ngrams(word_tokenize(text), n) return [ ' '.join(grams) for grams in n_grams] 

Output example

 get_ngrams('This is the simplest text i could think of', 3 ) ['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of'] 

To save ngrams in an array format, just delete ' '.join

+7
source

You can easily hack your own function to do this using itertools :

 from itertools import izip, islice, tee s = 'spam and eggs' N = 3 trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N)))) list(trigrams) # [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '), # ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'), # ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'), # ('e', 'g', 'g'), ('g', 'g', 's')] 
+3
source

I never did nltk, but I did N-grams as part of a small project. If you want to find the frequency of all N-grams occurring in a string, here is a way to do it. D will give you a histogram of your N-words.

 D = dict() string = 'whatever string...' strparts = string.split() for i in range(len(strparts)-N): # N-grams try: D[tuple(strparts[i:i+N])] += 1 except: D[tuple(strparts[i:i+N])] = 1 
+2
source

For four_grams, it is already in NLTK , here is a piece of code that can help you with this:

  from nltk.collocations import * import nltk #You should tokenize your text text = "I do not like green eggs and ham, I do not like them Sam I am!" tokens = nltk.wordpunct_tokenize(text) fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens) for fourgram, freq in fourgrams.ngram_fd.items(): print fourgram, freq 

Hope this helps.

+2
source

A more elegant approach to building bigrams with built-in pythons zip() . Just convert the original string to a split() list, then skip the list once, as usual, and once shifted by one element.

 string = "I really like python, it pretty awesome." def find_bigrams(s): input_list = s.split(" ") return zip(input_list, input_list[1:]) def find_ngrams(s, n): input_list = s.split(" ") return zip(*[input_list[i:] for i in range(n)]) find_bigrams(string) [('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')] 
+2
source

You can use sklearn.feature_extraction.text.CountVectorizer :

 import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size)) 

outputs:

 4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it'] 

You can set ngram_size any positive integer. That is, you can divide the text into four grams, five grams, or even one hundred grams.

0
source

Nltk is great, but sometimes it's overhead for some projects:

 import re def tokenize(text, ngrams=1): text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text) text = re.sub(r'\s+', ' ', text) tokens = text.split() return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)] 

Usage example:

 >> text = "This is an example text" >> tokenize(text, 2) [('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')] >> tokenize(text, 3) [('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')] 
0
source

If efficiency is a problem and you need to create several different n-grams (up to a hundred, as you say), but you want to use pure python, I would do:

 from itertools import chain def n_grams(seq, n=1): """Returns an itirator over the n-grams given a listTokens""" shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i) shiftedTokens = (shiftToken(i) for i in range(n)) tupleNGrams = zip(*shiftedTokens) return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams) def range_ngrams(listTokens, ngramRange=(1,2)): """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens.""" return chain(*(n_grams(listTokens, i) for i in range(*ngramRange))) 

Using:

 >>> input_list = input_list = 'test the ngrams generator'.split() >>> list(range_ngrams(input_list, ngramRange=(1,3))) [('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')] 

~ Same speed as NLTK:

 import nltk %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 nltk.ngrams(input_list,n=5) # 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 n_grams(input_list,n=5) # 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 nltk.ngrams(input_list,n=1) nltk.ngrams(input_list,n=2) nltk.ngrams(input_list,n=3) nltk.ngrams(input_list,n=4) nltk.ngrams(input_list,n=5) # 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 range_ngrams(input_list, ngramRange=(1,6)) # 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Discard my previous answer .

0
source

You can get all 4-6 grams using the code without another package below:

 from itertools import chain def get_m_2_ngrams(input_list, min, max): for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]): yield ' '.join(s) def get_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) if __name__ == '__main__': input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] for s in get_m_2_ngrams(input_list, 4, 6): print(s) 

output below:

 I am aware that am aware that nltk aware that nltk only that nltk only offers nltk only offers bigrams only offers bigrams and offers bigrams and trigrams bigrams and trigrams , and trigrams , but trigrams , but is , but is there but is there a is there a way there a way to a way to split way to split my to split my text split my text in my text in four-grams text in four-grams , in four-grams , five-grams four-grams , five-grams or , five-grams or even five-grams or even hundred-grams I am aware that nltk am aware that nltk only aware that nltk only offers that nltk only offers bigrams nltk only offers bigrams and only offers bigrams and trigrams offers bigrams and trigrams , bigrams and trigrams , but and trigrams , but is trigrams , but is there , but is there a but is there a way is there a way to there a way to split a way to split my way to split my text to split my text in split my text in four-grams my text in four-grams , text in four-grams , five-grams in four-grams , five-grams or four-grams , five-grams or even , five-grams or even hundred-grams I am aware that nltk only am aware that nltk only offers aware that nltk only offers bigrams that nltk only offers bigrams and nltk only offers bigrams and trigrams only offers bigrams and trigrams , offers bigrams and trigrams , but bigrams and trigrams , but is and trigrams , but is there trigrams , but is there a , but is there a way but is there a way to is there a way to split there a way to split my a way to split my text way to split my text in to split my text in four-grams split my text in four-grams , my text in four-grams , five-grams text in four-grams , five-grams or in four-grams , five-grams or even four-grams , five-grams or even hundred-grams 

You can find more information about this blog.

0
source

People have already responded well to a scenario where you need bigrams or trigrams, but if you need every gram for a sentence, then you can use nltk.util.everygrams

 >>> from nltk.util import everygrams >>> message = "who let the dogs out" >>> msg_split = message.split() >>> list(everygrams(msg_split)) [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')] 

If you have a limit, as in the case of trigrams, where the maximum length should be 3, then you can use the max_len parameter to determine it.

 >>> list(everygrams(msg_split, max_len=2)) [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')] 

You can simply change the max_len parameter to get any gram, i.e. four grams, five grams, six, or even one hundred grams.

The previous solutions mentioned may be modified to implement the above solutions, but this solution is much simpler than this.

For further reading click here.

And when you just need a certain gram, such as a bigram, a trigram, etc., you can use nltk.util.ngrams, as indicated in the answer of MAHassan.

0
source

Source: https://habr.com/ru/post/984294/


All Articles