I have pre-processed text data in the enclosure. Now I would like to build a forecasting model based on the previous two words (so what I think is a 3-gram model?). Based on my understanding of the articles I read, here's how I think about it:
Step 1: enter the two phrases of the words we want to predict the next word for
step 2: calculate 3 grams of frequency
library(RWeka) threegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3)) dtm_threegram <- DocumentTermMatrix(corpus, control=list(tokenize=threegramTokenizer)) threegram_freq <- sort(colSums(as.matrix(dtm_threegram)), decreasing = TRUE)
The next step is where I get stuck. Conceptually, I think I should multiply my 3-gram to include only three combinations of words that start with "I love." Then I should support only the highest frequency of 3 grams. For example, if โI love youโ appeared 12 times in my body, and โI love beerโ appeared 15 times, then the probability of โbeerโ, which is the next word, is higher than โloveโ, so the model should return the old one. This is the right approach, and if so, how can I create something like this programmatically? My threegram_freq object looks like a number class with a character attribute, which I don't quite understand what it is. Is it possible to use a regular expression to include only elements starting with โI loveโ and then extract the 3rd word from 3 grams at the highest frequency?
Thanks!
source share