How to get n-gram collocations and associations in python nltk?

Question

How to get n-gram collocations and associations in python nltk?

In this documentation, there is an example of using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder, nltk.collocations.TrigramAssocMeasures()and TrigramCollocationFinder.

There is an example of the pmi-based find nbest method for bigram and trigram. Example:

finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)

I know that BigramCollocationFinderthey TrigramCollocationFinderinherit from AbstractCollocationFinder.Bye BigramAssocMeasures()and TrigramAssocMeasures()inherit fromNgramAssocMeasures.

How can I use methods (e.g. nbest()) in AbstractCollocationFinderand NgramAssocMeasuresfor 4-gram, 5-gram, 6-gram, ...., n-gram (for example, using a bigram and a trigram is easy)?

Should I create a class that inherits AbstractCollocationFinder?

Thanks.

+6

python nlp nltk n-gram collocation

Fahmi rizal Sep 7 '13 at 9:58

2

, 2 3 , scikit package Freqdist, . nltk.collocations, , , 3- . . , . Thankz

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

query = "This document gives a very short introduction to machine learning problems"
vect = CountVectorizer(ngram_range=(1,4))
analyzer = vect.build_analyzer()
listNgramQuery = analyzer(query)
listNgramQuery.reverse()
print "listNgramQuery=", listNgramQuery
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print "\nNgramQueryWeights=", NgramQueryWeights

listNgramQuery= [u'to machine learning problems', u'introduction to machine learning', u'short introduction to machine', u'very short introduction to', u'gives very short introduction', u'document gives very short', u'this document gives very', u'machine learning problems', u'to machine learning', u'introduction to machine', u'short introduction to', u'very short introduction', u'gives very short', u'document gives very', u'this document gives', u'learning problems', u'machine learning', u'to machine', u'introduction to', u'short introduction', u'very short', u'gives very', u'document gives', u'this document', u'problems', u'learning', u'machine', u'to', u'introduction', u'short', u'very', u'gives', u'document', u'this']

NgramQueryWeights= <FreqDist: u'document': 1, u'document gives': 1, u'document gives very': 1, u'document gives very short': 1, u'gives': 1, u'gives very': 1, u'gives very short': 1, u'gives very short introduction': 1, u'introduction': 1, u'introduction to': 1, ...>

+8

Gunjan 02 . '13 9:14

alvas · Accepted Answer · 2013-09-10T12:53:32+0000

Edited

NLTK hardcoder QuadCollocationFinder, , NgramCollocationFinder, from_words() ngram.

, AbstractCollocationFinder (ACF), nbest(), collocations 2- 3-.

- from_words() ngrams. , ACF (.. BigramCF TrigramCF) from_words().

>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
>>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'

from_words() TrigramCF:

from nltk.probability import FreqDist
@classmethod
def from_words(cls, words):
    wfd, wildfd, bfd, tfd = (FreqDist(),)*4

    for w1,w2,w3 in ingrams(words,3,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

    return cls(wfd, bfd, wildfd, tfd)

- hardcode 4- :

@classmethod
def from_words(cls, words):
    wfd, wildfd = (FreqDist(),)*2
    bfd, tfd ,fofd = (FreqDist(),)*3

    for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

      if w4 is None:
        continue
      wildfd.inc((w1,w4))
      wildfd.inc((w2,w4))
      wildfd.inc((w3,w4))
      wildfd.inc((w1,w3))
      wildfd.inc((w2,w3))
      wildfd.inc((w1,w2))
      ffd.inc((w1,w2,w3,w4))

    return cls(wfd, bfd, wildfd, tfd, ffd)

, cls, from_words .

, , ?

2 3- , .
, 2 3grams, .

How to get n-gram collocations and associations in python nltk?

Edited

More articles: