How to replace a string with a number based on a position in a data frame?

I have a row vector in the following format:

strings <- c("UUDBK", "KUVEB", "YVCYE") 

I also have a data frame like this:

 replacewith <- c(8, 4, 2) searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN") dataframe <- data.frame(replacewith, searchhere) 

I want the row vector to be replaced with the value in the corresponding "replacewith" column in this data frame. I am currently doing this:

 final <- sapply(as.character(strings), function(x) as.numeric(dataframe[grep(x, dataframe$searchhere), 1])) 

However, it is very difficult to calculate to do this with a character vector with a length of 10 ^ 9.

What is the best way to do this?

Thanks!

+5
source share
2 answers

Like @AntoniosK's idea, hashmap is used hashmap to match strings to their values. hashmap is implemented internally with Rcpp , so it is very fast:

 library(hashmap) library(tidyr) search_replace = separate_rows(dataframe, searchhere) search_hash = hashmap(search_replace[,2], search_replace[,1]) search_hash[[strings]] 

Results:

 > search_hash ## (character) => (numeric) ## [KHUDN] => [+2.000000] ## [KUEBN] => [+2.000000] ## [UGEVB] => [+4.000000] ## [KUVEB] => [+4.000000] ## [IYVEK] => [+8.000000] ## [IHVYV] => [+8.000000] ## [...] => [...] > search_hash[[strings]] [1] 8 4 8 

Landmarks:

 > OP_func = function(){sapply(as.character(strings), function(x) as.numeric(dataframe[grep(x,dataframe$searchhere), 1]))} Unit: microseconds expr min lq mean median uq max neval OP_func() 121.191 124.9410 190.36472 129.8760 151.193 3370.047 100 d[d$searchhere %in% strings, ] 36.714 40.6605 52.85093 43.8185 61.583 147.246 100 search_hash[[strings]] 14.212 18.1590 25.05212 21.5150 29.608 58.820 100 

Also note that the @AntoniosK solution does not work if there are duplicates in strings , and hashmap will return the correct display for each element in the correct position.

Example:

 > strings_large = sample(search_replace$searchhere, 100, replace = TRUE) > strings_large [1] "YVCYE" "KUVEB" "KUYVE" "KHUDN" "KUYVE" "KHUDN" "KUEBN" "UUDBK" "KHUDN" "YVCYE" "IYVEK" [12] "KUEBN" "KHUDN" "IHBEJ" "YVCYE" "KHUDN" "KUEBN" "UGEVB" "UUDBK" "KUYVE" "KHUDN" "IHBEJ" [23] "IHVYV" "KUVEB" "IYVEK" "KHUDN" "KHUDN" "KUYVE" "YVCYE" "UUDBK" "KUYVE" "IHVYV" "KUYVE" [34] "KUEBN" "KUYVE" "UUDBK" "KUYVE" "KUVEB" "KUVEB" "YVCYE" "KUYVE" "KHUDN" "KUVEB" "YVCYE" [45] "IHBEJ" "YVCYE" "KHUDN" "UUDBK" "KUEBN" "IYVEK" "IHVYV" "UUDBK" "KUYVE" "KUEBN" "YVCYE" [56] "UGEVB" "YVCYE" "KUYVE" "IHVYV" "KUEBN" "IHVYV" "IHBEJ" "KUVEB" "IHVYV" "KUYVE" "KUEBN" [67] "IYVEK" "KUVEB" "KUEBN" "UGEVB" "KUEBN" "KUVEB" "IHBEJ" "KUYVE" "YVCYE" "YVCYE" "IHVYV" [78] "YVCYE" "KHUDN" "KHUDN" "YVCYE" "IYVEK" "KUYVE" "KHUDN" "UGEVB" "YVCYE" "IHVYV" "KUVEB" [89] "IYVEK" "KUEBN" "UGEVB" "UUDBK" "IYVEK" "IHBEJ" "IHBEJ" "UUDBK" "KUVEB" "UGEVB" "IYVEK" [100] "IYVEK" > search_hash[[strings_large]] [1] 8 4 8 2 8 2 2 8 2 8 8 2 2 2 8 2 2 4 8 8 2 2 8 4 8 2 2 8 8 8 8 8 8 2 8 8 8 4 4 8 8 2 4 8 [45] 2 8 2 8 2 8 8 8 8 2 8 4 8 8 8 2 8 2 4 8 8 2 8 4 2 4 2 4 2 8 8 8 8 8 2 2 8 8 8 2 4 8 8 4 [89] 8 2 4 8 8 2 2 8 4 4 8 8 
+2
source
 library(tidyr) strings <- c("UUDBK", "KUVEB", "YVCYE") replacewith <- c(8, 4, 2) searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN") dataframe <- data.frame(replacewith, searchhere, stringsAsFactors = F) # split strings to one row each # like a look up table d = separate_rows(dataframe, searchhere) # get the number based on the look up table d[d$searchhere %in% strings,] # replacewith searchhere # 1 8 UUDBK # 2 8 YVCYE # 6 4 KUVEB 

Not sure if you like this format, but you can always change it.

+2
source

Source: https://habr.com/ru/post/1273249/


All Articles