I have character strings that look like this:
[1] "What can we learn from the Mahabharata "
[2] "What are the most iconic songs associated with the Vietnam War "
[3] "What are some major social faux pas to avoid when visiting Malta "
[4] "Will Ready Boost technology contribute to CFD software usage "
[5] "Who is Jon Snow " ...
and a data frame that assigns each word a score:
word score
the 11
to 9
What 9
I 7
a 6
are 6
I want to assign to each of my lines the sum of the points from the words contained in it, my solution is the following function
score_fun<- function(x)
{z <- unlist(strsplit(x,' '));
return(sum(word_scores$score[word_scores$word %in% z]))}
scores <- sapply(my_strings, score_fun, USE.NAMES = F)
scores
[1] 20 26 24 9 0 0 38 32 30 0
the problem that I encountered is performance, I have about 500 thousand lines and more than a million words, and the function takes more than one hour on my I-7, 16 GB machine. in addition, the decision just feels inelegant, awkward ..
is there a better (more efficient) solution?
to play data:
my_strings <- c("What can we learn from the Mahabharata ", "What are the most iconic songs associated with the Vietnam War ",
"What are some major social faux pas to avoid when visiting Malta ",
"Will Ready Boost technology contribute to CFD software usage ",
"Who is Jon Snow ", "Do weighing scales measure mass or weight ",
"What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes ",
"Is it mandatory to stay for 11 months in a rented house if the rental agreement was made for 11 months ",
"What are some really good positive comments to say on a cricket field to your teammates ",
"Is Donald Trump fact free ")
word_scores <- data.frame(word = c("the", "to", "What", "I", "a", "are", "in", "of", "and", "do"
), score = c(11L, 9L, 9L, 7L, 6L, 6L, 6L, 6L, 3L, 3L), stringsAsFactors = F)