Improve the performance of calculating the sum of word points on a large vector of lines?

Question

Improve the performance of calculating the sum of word points on a large vector of lines?

I have character strings that look like this:

 [1] "What can we learn from the Mahabharata "                                                                
 [2] "What are the most iconic songs associated with the Vietnam War "                                        
 [3] "What are some major social faux pas to avoid when visiting Malta "                                      
 [4] "Will Ready Boost technology contribute to CFD software usage "                                          
 [5] "Who is Jon Snow " ...

and a data frame that assigns each word a score:

   word score
   the    11
    to     9
  What     9
     I     7
     a     6
   are     6

I want to assign to each of my lines the sum of the points from the words contained in it, my solution is the following function

 score_fun<- function(x)

 # obtaining the list of words 

 {z <- unlist(strsplit(x,' ')); 

 # returning the sum of the words' scores     

 return(sum(word_scores$score[word_scores$word %in% z]))} 

 # using sapply() in conjunction with the function  

 scores <- sapply(my_strings, score_fun, USE.NAMES = F)

 # the output will look like 
 scores
 [1] 20 26 24  9  0  0 38 32 30  0

the problem that I encountered is performance, I have about 500 thousand lines and more than a million words, and the function takes more than one hour on my I-7, 16 GB machine. in addition, the decision just feels inelegant, awkward ..

is there a better (more efficient) solution?

to play data:

 my_strings <- c("What can we learn from the Mahabharata ", "What are the most iconic songs associated with the Vietnam War ", 
"What are some major social faux pas to avoid when visiting Malta ", 
"Will Ready Boost technology contribute to CFD software usage ", 
"Who is Jon Snow ", "Do weighing scales measure mass or weight ", 
"What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes ", 
"Is it mandatory to stay for 11 months in a rented house if the rental agreement was made for 11 months ", 
"What are some really good positive comments to say on a cricket field to your teammates ", 
"Is Donald Trump fact free ")


word_scores <- data.frame(word = c("the", "to", "What", "I", "a", "are", "in", "of", "and", "do"
), score = c(11L, 9L, 9L, 7L, 6L, 6L, 6L, 6L, 3L, 3L), stringsAsFactors = F)

+4

performance string vectorization loops r

Mouad_s Apr 22 '17 at 23:45

source share

2

, word_scores , , id. sapply.

list_strings <- lapply(my_strings, function(i) unique(unlist(strsplit(s, split=" "))))      

ids <- lapply(seq(length(list_strings)), function(i) rep(i, length(list_strings[[i]])))

phrases_df <- data.frame(id=Reduce(append, ids), word=Reduce(append, list_strings))      
aggdf <- aggregate(score~id, merge(phrases_df, word_scores, by="word"), FUN=sum)

aggdf 
#   id score
# 1  1    20
# 2  2    26
# 3  3    24
# 4  4     9
# 5  7    38
# 6  8    32
# 7  9    30

microbenchmark , (1 = 1000 ) , , sapply.

library(micorbenchmark)

microbenchmark({
   list_strings <- lapply(my_strings, function(i) unique(unlist(strsplit(s, split=" "))))

   ids <- lapply(seq(length(list_strings)), function(i) rep(i, length(list_strings[[i]])))

   phrases_df <- data.frame(id=Reduce(append, ids), word=Reduce(append, list_strings))      
   aggdf <- aggregate(score~id, merge(phrases_df, word_scores, by="word"), FUN=sum)

})

# Unit: milliseconds
#      min       lq     mean   median       uq      max neval
# 5.623328 5.808831 6.177336 5.964018 6.252019 10.09706   100

microbenchmark({
  score_fun<- function(x) {
     z <- unlist(strsplit(x,' '))
     return(sum(word_scores$score[word_scores$word %in% z]))
  } 
  scores <- sapply(my_strings, score_fun, USE.NAMES = F)
})

# Unit: microseconds
#       min      lq     mean  median       uq     max neval
# 809.382 843.307 1005.366 865.442 1209.983 1873.32   100

+2

Parfait 23 . '17 0:44

alistaire · Accepted Answer · 2017-04-23T01:24:55+0000

tidytext::unnest_tokens, :

library(tidyverse)
library(tidytext)

data_frame(string = my_strings, id = seq_along(string)) %>% 
    unnest_tokens(word, string, 'words', to_lower = FALSE) %>% 
    distinct() %>%
    left_join(word_scores) %>% 
    group_by(id) %>%
    summarise(score = sum(score, na.rm = TRUE))

#> # A tibble: 10 × 2
#>       id score
#>    <int> <int>
#> 1      1    20
#> 2      2    26
#> 3      3    24
#> 4      4     9
#> 5      5     0
#> 6      6     0
#> 7      7    38
#> 8      8    32
#> 9      9    30
#> 10    10     0

, , .

, , . my_strings 10000:

Unit: milliseconds
     expr        min         lq      mean    median        uq       max neval
   Reduce 5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502   100
   sapply  460.75930  486.94336  511.2762  503.4932  532.2363  746.8376   100
 tidytext   86.92182   94.65745  101.7064  100.1487  107.3289  134.7276   100

Improve the performance of calculating the sum of word points on a large vector of lines?

More articles: