Substring + get the words around the keyword

Question

Substring + get the words around the keyword

If I have a line:

moon <- "The cow jumped over the moon with a silver plate in its mouth"

Is there a way to extract words in the neighborhood of "moon" . The neighborhood may be 2 or 3 words around the "moon".

So if my

 "The cow jumped over the moon with a silver plate in its mouth"

I want my output to be only:

 "jumped over the moon with a silver"

I know that I can use str_locate if I want to extract characters, but I'm not sure how to do this using "words". Can this be done in R?

Thanks and Regards, Saymak

+4

r

BRZ Aug 1 '13 at 0:35

source share

3 answers

Here's how I would do it:

 keyword <- "moon" lookaround <- 2 pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword, "( [[:alpha:]]+){0,", lookaround, "}") regmatches(str, regexpr(pattern, str))[[1]] # [1] "The cow jumped over"

Idea: search for any character followed by a space with a minimum of 0 times and a maximum value of “lookaround” (here 2) times, followed by a “keyword” (here “moon”), followed by a space and a set of characters repeating between 0 and "reverse" times. The regexpr function gives the beginning and end of this pattern. regmatches , which wraps this function, then extracts a substring from these start / stop positions.

Note: regexpr can be replaced with gregexpr if you want to find more than one occurrence of the same pattern.

Here's a comparative analysis of big data comparing Hong with this answer:

 str <- "The cow jumped over the moon with a silver plate in its mouth" ll <- rep(str, 1e5) hong <- function(str) { str <- strsplit(str, " ") sapply(str, function(y) { i <- which(y=="moon") paste(y[seq(max(1, (i-2)), min((i+2), length(y)))], collapse= " ") }) } arun <- function(str) { keyword <- "moon" lookaround <- 2 pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword, "( [[:alpha:]]+){0,", lookaround, "}") regmatches(str, regexpr(pattern, str)) } require(microbenchmark) microbenchmark(t1 <- hong(ll), t2 <- arun(ll), times=10) # Unit: seconds # expr min lq median uq max neval # t1 <- hong(ll) 6.172986 6.384981 6.478317 6.654690 7.193329 10 # t2 <- arun(ll) 1.175950 1.192455 1.200674 1.227279 1.326755 10 identical(t1, t2) # [1] TRUE

+4

Arun Aug 1 '13 at 0:48

source share

Here's the tm package approach (when you have a hammer ...)

 moon <- "The cow jumped over the moon with a silver plate in its mouth" require(tm) my.corpus <- Corpus(VectorSource(moon)) # Tokenizer for n-grams and passed on to the term-document matrix constructor library(RWeka) neighborhood <- 3 # how many words either side of word of interest neighborhood1 <- 2 + neighborhood * 2 ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = neighborhood1, max = neighborhood1)) dtm <- TermDocumentMatrix(my.corpus, control = list(tokenize = ngramTokenizer)) inspect(dtm) # find ngrams that have the word of interest in them word <- 'moon' subset_ngrams <- dtm$dimnames$Terms[grep(word, dtm$dimnames$Terms)] # keep only ngrams with the word of interest in the middle. This # removes duplicates and lets us see what on either side # of the word of interest subset_ngrams <- subset_ngrams[sapply(subset_ngrams, function(i) { tmp <- unlist(strsplit(i, split=" ")) tmp <- tmp[length(tmp) - span] tmp} == word)] # inspect output subset_ngrams [1] "jumped over the moon with a silver plate"

+1

Ben Aug 1 '13 at 3:34

source share

Hong ooi · Accepted Answer · 2013-08-01T00:40:17+0000

Use strsplit :

 x <- strsplit(str, " ")[[1]] i <- which(x == "moon") paste(x[seq(max(1, (i-2)), min((i+2), length(x)))], collapse= " ")

Substring + get the words around the keyword

Here's a comparative analysis of big data comparing Hong with this answer:

More articles: