Problems Trigramming Using Gensim

I want to get bitrams and trigrams from the above example sentences.

My code works great for bigrams. However, it does not capture trigrams in the data (for example, interaction with a human computer, which is mentioned in 5 places of my sentences).

Approach 1 Below is my code using phrases in Gensim.

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ')
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

Approach 2 I even tried using Phraser and Phrases, but that didn't work.

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

Please help me fix this problem of getting trigrams.

I follow the sample documentation from Gensim.

+4
source share
1

:

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
    bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
    trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]

    print(bigrams_)
    print(trigrams_)

threshold = 1 bigram Phrases, , , , ( , bigram Phrases); , , , . min_count , 5, .

, , , , .


- threshold:

, a b , , :

(count(a followed by b) - min_count) * N/(count(a) * count(b)) > threshold

N - . 10 (. docs). , threshold, .

, threshold = 1, ['human computer','interaction is'] digrams 3 5 , " "; .

, threshold = 10, ['human computer interaction is'] 3 ( ); 4 , if t.count(' ') == 2. , , , 1, [ " " ] . , , .

, : , ( "" ), , , 4-...

+1

Source: https://habr.com/ru/post/1685414/


All Articles