How to predict the next word in a sentence using the ngram model in R

I have pre-processed text data in the enclosure. Now I would like to build a forecasting model based on the previous two words (so what I think is a 3-gram model?). Based on my understanding of the articles I read, here's how I think about it:

Step 1: enter the two phrases of the words we want to predict the next word for

# phrase our word prediction will be based on phrase <- "I love" 

step 2: calculate 3 grams of frequency

 library(RWeka) threegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3)) dtm_threegram <- DocumentTermMatrix(corpus, control=list(tokenize=threegramTokenizer)) threegram_freq <- sort(colSums(as.matrix(dtm_threegram)), decreasing = TRUE) 

The next step is where I get stuck. Conceptually, I think I should multiply my 3-gram to include only three combinations of words that start with "I love." Then I should support only the highest frequency of 3 grams. For example, if โ€œI love youโ€ appeared 12 times in my body, and โ€œI love beerโ€ appeared 15 times, then the probability of โ€œbeerโ€, which is the next word, is higher than โ€œloveโ€, so the model should return the old one. This is the right approach, and if so, how can I create something like this programmatically? My threegram_freq object looks like a number class with a character attribute, which I don't quite understand what it is. Is it possible to use a regular expression to include only elements starting with โ€œI loveโ€ and then extract the 3rd word from 3 grams at the highest frequency?

Thanks!

+5
source share

Source: https://habr.com/ru/post/1262498/


All Articles