Quick match words in a dictionary for text vector in R

Question

Quick match words in a dictionary for text vector in R

I have a very long vector of short texts in R (for example, a length of 10 million). The first five items of the list are as follows:

"I'm an angry tiger."
"I'm an unlucky clam."
"I am an angry and miserable tiger."
"I'm an angry, angry tiger."
"Sound signal".

I have a dictionary that, we will say, consists of the words "angry" and "displeased."

What is the fastest way to get the number of matches from this dictionary on a text vector? In this case, the correct answer would be a vector [1, 1, 2, 2, 0].

I tried solutions involving quantedaand tm, and basically all of them fail, because I cannot store in memory any large matrix of document characteristics. Bonus points for any solution using qdap, dplyrand termco.

+4

text r

mlachans Jan 2 '17 at 8:46

source share

2 answers

We can use methods base Rwith regexprandReduce

Reduce(`+`, lapply(dict, function(x) lengths(regmatches(txt, gregexpr(x, txt)))))
#[1] 1 1 2 2 0

Or a faster approach would be

Reduce(`+`, lapply(dict, function(x) vapply(gregexpr(x, txt),
          function(y) sum(attr(y, "match.length")>0), 0)))
#[1] 1 1 2 2 0

NOTE. When using large data sets and a large number of dictionary elements, this method will not have any restrictions.

data

txt <- c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
          "I am an angry, angry, tiger." ,"Beep boop.") 
dict <- c("angry", "unhappy")

+6

akrun Jan 2 '17 at 8:49

source share

Sotos · Accepted Answer · 2017-01-02T08:55:05+0000

Using the package stringi,

library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0

DATA

dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")

Quick match words in a dictionary for text vector in R

data

More articles: