Extract a sample of words around a specific word using stringr in R

I saw a couple of similar questions posted on SO on this topic, but they seem to be spelled incorrectly ( example ) or in another language ( example ).

In my script, I consider everything that is surrounded by white space to be a word. Emoticons, numbers, lines of letters that are not really words, I don't care. I just want to get some context around the found line without having to read the entire file to find out if it really matches.

I tried using the following, but it takes a long time to start if you have a long text file:

text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."

stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")

I guess there is a much, much faster / more efficient way to do this, right?

+4
source share
3 answers

Try the following:

stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

Change the inside number {}to suit your needs.

You can use non-capture groups (?:)as well, although I'm still not sure if the speed will improve.

stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")
+3
source

I would use unlist(strsplit)and then index the resulting vector. You can make this a function so that the number of words to extract pre and post is a flexible parameter:

getContext <- function(text, look_for, pre = 3, post=pre) {
  # create vector of words (anything separated by a space)
  t_vec <- unlist(strsplit(text, '\\s'))

  # find position of matches
  matches <- which(t_vec==look_for)

  # return words before & after if any matches
  if(length(matches) > 0) {
    out <- 
      list(before = ifelse(m-pre < 1, NA, 
                           sapply(matches, function(m) t_vec[(m - pre):(m - 1)])), 
           after = sapply(matches, function(m) t_vec[(m + 1):(m + post)]))

    return(out)
  } else {
    warning('No matches')
  }
}

Works for one match.

getContext(text, 'Verulam')

# $before
#      [,1]     
# [1,] "and"    
# [2,] "created"
# [3,] "Baron"  
# 
# $after
#      [,1]     
# [1,] "in"     
# [2,] "1618[4]"
# [3,] "and"   

Also works if there is more than one match.

getContext(text, 'he')

# $before
#      [,1]     [,2]           [,3]          [,4]     
# [1,] "After"  "nature."      "in"          "John"   
# [2,] "his"    "Most"         "1621;[3][b]" "Aubrey" 
# [3,] "death," "importantly," "as"          "stating"
# 
# $after
#      [,1]          [,2]     [,3]      [,4]        
# [1,] "remained"    "argued" "died"    "contracted"
# [2,] "extremely"   "this"   "without" "the"       
# [3,] "influential" "could"  "heirs,"  "condition" 

getContext(text, 'fruitloops')
# Warning message:
#   In getContext(text, "fruitloops") : No matches
+3
source

, , data.frame, R.

context <- function(text){
  splittedText <- strsplit(text, ' ', T)[[1]]
  print(splittedText)

  data.frame(
    words  = splittedText,
    before = head(c('', splittedText), -1), 
    after  = tail(c(splittedText, ''), -1)
  )
}

Much cleaner IMO:

info <- context(text)

print(subset(info, words == 'Verulam'))

print(subset(info, before == 'Lord'))

print(subset(info, grepl('[[:digit:]]', words)))

#       words before #after
# 161 Verulam  Baron    in
#        words before after
# 9 Chancellor   Lord    of
#             words before after
# 43  empiricism.[6]     of   His
# 157           1603     in   and
# 163        1618[4]     in   and
# 169    1621;[3][b]     in    as
# 187          1626,     in  with
0
source

Source: https://habr.com/ru/post/1621153/


All Articles