How to predict the next word in a sentence using the ngram model in R

Question

How to predict the next word in a sentence using the ngram model in R

I have pre-processed text data in the enclosure. Now I would like to build a forecasting model based on the previous two words (so what I think is a 3-gram model?). Based on my understanding of the articles I read, here's how I think about it:

Step 1: enter the two phrases of the words we want to predict the next word for

# phrase our word prediction will be based on phrase <- "I love"

step 2: calculate 3 grams of frequency

 library(RWeka) threegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3)) dtm_threegram <- DocumentTermMatrix(corpus, control=list(tokenize=threegramTokenizer)) threegram_freq <- sort(colSums(as.matrix(dtm_threegram)), decreasing = TRUE)

The next step is where I get stuck. Conceptually, I think I should multiply my 3-gram to include only three combinations of words that start with "I love." Then I should support only the highest frequency of 3 grams. For example, if “I love you” appeared 12 times in my body, and “I love beer” appeared 15 times, then the probability of “beer”, which is the next word, is higher than “love”, so the model should return the old one. This is the right approach, and if so, how can I create something like this programmatically? My threegram_freq object looks like a number class with a character attribute, which I don't quite understand what it is. Is it possible to use a regular expression to include only elements starting with “I love” and then extract the 3rd word from 3 grams at the highest frequency?

Thanks!

+5

r text-processing nlp n-gram prediction

heyydrien Jan 08 '17 at 20:07

source share

No one has answered this question yet.

See related questions:

3

Prediction of the next word text2vec in R

3

Embedding n-grams to predict the next word

2