Quick match words in a dictionary for text vector in R

I have a very long vector of short texts in R (for example, a length of 10 million). The first five items of the list are as follows:

  • "I'm an angry tiger."
  • "I'm an unlucky clam."
  • "I am an angry and miserable tiger."
  • "I'm an angry, angry tiger."
  • "Sound signal".

I have a dictionary that, we will say, consists of the words "angry" and "displeased."

What is the fastest way to get the number of matches from this dictionary on a text vector? In this case, the correct answer would be a vector [1, 1, 2, 2, 0].

I tried solutions involving quantedaand tm, and basically all of them fail, because I cannot store in memory any large matrix of document characteristics. Bonus points for any solution using qdap, dplyrand termco.

+4
source share
2 answers

Using the package stringi,

library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0

DATA

dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")
+8
source

We can use methods base Rwith regexprandReduce

Reduce(`+`, lapply(dict, function(x) lengths(regmatches(txt, gregexpr(x, txt)))))
#[1] 1 1 2 2 0

Or a faster approach would be

Reduce(`+`, lapply(dict, function(x) vapply(gregexpr(x, txt),
          function(y) sum(attr(y, "match.length")>0), 0)))
#[1] 1 1 2 2 0

NOTE. When using large data sets and a large number of dictionary elements, this method will not have any restrictions.

data

txt <- c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
          "I am an angry, angry, tiger." ,"Beep boop.") 
dict <- c("angry", "unhappy")
+6
source

Source: https://habr.com/ru/post/1665499/


All Articles