Why does the ngrams () function give separate bigrams?

I am writing an R script and using the library (ngram).

Suppose I have a string,

"good qualiti dog food buy segment vital dog food product found good quality product look like stew meat smell better Labrador finicki boost product better"

and want to find bigrams.

The ngram library gives me bigrams as follows:

"value the product", "meat", "food product", "purchased food", "quality dog", "product found", "type of product", "similar", "like stew" "good qualiti" "labrador finicki "" quality product "" best labrador " " dog food " " smells better "" vital "" meat smell "" found good "," separate the vital "" stewing process "," can the dog "" finicki appreci "" the product is better "

Since the sentence contains “dog food” two times, I want this two-gram two times. But I get it once!

Is there an option in thengram library or in any other library that gives all the bigrams of my sentence in R?

+5
source share
5 answers

You can use stylo package. Gives duplicates:

 library(stylo) a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better" b = txt.to.words(a) c = make.ngrams(b, ngram.size = 2) print(c) 

Result:

  [1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can" "can dog" "dog food" [10] "food product" "product found" "found good" "good qualiti" "qualiti product" "product look" "look like" "like stew" "stew process" [19] "process meat" "meat smell" "smell better" "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better" > 
+5
source

The ngram development ngram has a get.phrasetable method:

 devtools::install_github("wrathematics/ngram") library(ngram) text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better" ng <- ngram(text) head(get.phrasetable(ng)) # ngrams freq prop # 1 good qualiti 2 0.07692308 # 2 dog food 2 0.07692308 # 3 appreci product 1 0.03846154 # 4 process meat 1 0.03846154 # 5 food product 1 0.03846154 # 6 food bought 1 0.03846154 

Alternatively, you can use the print() method and specify output == "full" . I.e:

 print(ng, output = "full") # NOTE: more output not shown... better labrador | 1 finicki {1} | dog food | 2 product {1} | bought {1} # NOTE: more output not shown... 
+6
source

You can use RWeka. As a result, you can see "dog food" and "good qualiti" appearing twice

 txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better" library(RWeka) RWEKABigramTokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = 2, max = 2)) } RWEKABigramTokenizer(txt) [1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can" [8] "can dog" "dog food" "food product" "product found" "found good" "good qualiti" "qualiti product" [15] "product look" "look like" "like stew" "stew process" "process meat" "meat smell" "smell better" [22] "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better" 

Or use the tm package in conjunction with RWeka

 library(tm) library(RWeka) my_corp <- Corpus(VectorSource(txt)) tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer)) #show the 2 bigrams findFreqTerms(tdm_RWEKA, lowfreq = 2) [1] "dog food" "good qualiti" #turn into matrix with frequency counts tdm_matrix <- as.matrix(tdm_RWEKA) 
+3
source

To create such a bigram, you do not need a special package. Basically, split the text and paste it again.

 t <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better" ug <- strsplit(t, " ")[[1]] bg <- paste(ug, ug[2:length(ug)]) 

The bg will be:

 [1] "good qualiti" "qualiti dog" "dog food" [4] "food bought" "bought sever" "sever vital" [7] "vital can" "can dog" "dog food" [10] "food product" "product found" "found good" [13] "good qualiti" "qualiti product" "product look" [16] "look like" "like stew" "stew process" [19] "process meat" "meat smell" "smell better" [22] "better labrador" "labrador finicki" "finicki appreci" [25] "appreci product" "product better" "better qualiti" 
+3
source

Try the quanteda package:

 > quanteda::tokenize(txt, ngrams = 2, concatenator = " ") [[1]] [1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" [7] "vital can" "can dog" "dog food" "food product" "product found" "found good" [13] "good qualiti" "qualiti product" "product look" "look like" "like stew" "stew process" [19] "process meat" "meat smell" "smell better" "better labrador" "labrador finicki" "finicki appreci" [25] "appreci product" "product better" 

Many additional arguments are available through ngrams , including getting different combinations of n sizes or skips.

+1
source

Source: https://habr.com/ru/post/1232553/


All Articles