Creating N-grams with tm & RWeka - works with VCorpus, but not with Corpus

Following the numerous tutorials on creating biGrams using the tm and RWeka packages, I was disappointed that only 1 grams returned in TOGO . Thanks to a lot of trial and error, I found that the correct function was achieved using VCorpus ', but not using Corpus . By the way, I am sure that this worked with the "Corps" ~ 1 month ago, but this is not now.

R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest version.

I would appreciate any understanding that this would not work with Corpus, and if others would have the same problem.

#A Reproducible example # #Weka bi-gram test # library(tm) library(RWeka) someCleanText <- c("Congress shall make no law respecting an establishment of", "religion, or prohibiting the free exercise thereof or", "abridging the freedom of speech or of the press or the", "right of the people peaceably to assemble and to petition", "the Government for a redress of grievances") aCorpus <- Corpus(VectorSource(someCleanText)) #With this, only 1-Grams are created #aCorpus <- VCorpus(VectorSource(someCleanText)) #With this, biGrams are created as desired BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2)) aTDM <- TermDocumentMatrix(aCorpus, control=list(tokenize=BigramTokenizer)) print(aTDM$dimnames$Terms) 

Result Using Case

  [1] "congress" "establishment" "law" "make" [5] "respecting" "shall" "exercise" "free" [9] "prohibiting" "religion" "the" "thereof" [13] "abridging" "freedom" "press" "speech" [17] "and" "assemble" "peaceably" "people" [21] "petition" "right" "for" "government" [25] "grievances" "redress" 

Result with 'VCorpus'

  [1] "a redress" "abridging the" "an establishment" "and to" [5] "assemble and" "congress shall" "establishment of" "exercise thereof" [9] "for a" "free exercise" "freedom of" "government for" [13] "law respecting" "make no" "no law" "of grievances" [17] "of speech" "of the" "or of" "or prohibiting" [21] "or the" "peaceably to" "people peaceably" "press or" [25] "prohibiting the" "redress of" "religion or" "respecting an" [29] "right of" "shall make" "speech or" "the free" [33] "the freedom" "the government" "the people" "the press" [37] "thereof or" "to assemble" "to petition" 
+5
source share
1 answer

I worked with R.3.4.1 and changed to R3.3.3, now the VCorpus solution worked for me. Both TM and RWeka correctly create bigrams.

 sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) 
0
source

Source: https://habr.com/ru/post/1265367/


All Articles