Replace letters in pattern string in R

Given a “UK postal code” pattern, for example “A9 9AA”, where “A” is a letter placeholder and “9” is a number placeholder, I want to generate random zip code strings like “H8 4GB”, the letters can be any uppercase letter and numbers from 0 to 9.

So, if the pattern is "AA9A 9AA", then I need strings like "WC1A 9LK". I ignore while creating "real" zip codes, so I don't worry if "WC1A" is valid external code.

I tried to get the functions from the stringi package to work, but the problem is that replacing or matching "A" in the template will only replace the first replacement, for example:

  stri_replace_all_fixed("A9 9AA",c("A","A","A"), c("X","Y","Z"), vectorize_all=FALSE) [1] "X9 9XX" 

therefore, it does not replace each “A” with each element from the replacement vector (but this is by design).

Maybe there is something in stringi or the R base that I missed - I would like to leave it in these packages so that I don't inflate what I'm working on.

The brute force method is to break the pattern, make replacements, insert the result back together, but I would like to see if there is a faster, naturally vectorized solution.

So, we summarize:

 foo("A9 9AA") # return like "B6 5DE" foo(c("A9 9AA","A9 9AA","A9A 9AA")) # returns c("Y6 5TH","D4 8JH","W0Z 3KQ") 

Here's a non-vector version that relies on building an expression and evaluating it ...

 random_pc <- function(fmt){ cc = gsub(" ",'c(" ")',gsub("9","sample(0:9,1)",gsub("A","sample(LETTERS,1)",strsplit(fmt,"")[[1]]))) paste(eval(parse(text=paste0("c(",paste(cc,collapse=","),")"))),collapse="") } > random_pc("AA9 9AA") [1] "KO6 1AY" 
+5
source share
3 answers

As I understand it, OP wants to accidentally create UK POST CODE in the specified format. I think sprintf might help:

 sprintf("%s%s %d%d%s", sample(LETTERS,1),sample(LETTERS,1), sample(0:9,1), sample(0:9,1), sample(LETTERS,1) ) #1] "BC 59D" 

Now, if the goal is to provide a format using 9 and A , then the step will first replace 9 with %d and A with %s .

OPTION # 2

Another option can be achieved with paste0 and sapply using a custom function like:

 fmt <- "AA 9AA A" paste0(sapply(strsplit(fmt,""), getCodeText), collapse = "") #"YF 7OP Z" #custom function to generate random characters getCodeText <- function(x){ retVal = x for(i in seq_along(x)){ if(x[i] == "A"){ retVal[i] = sample(LETTERS,1) }else if(x[i] == "9"){ retVal[i] = as.character(sample(0:9,1)) } } retVal } 
+4
source

Here's a solution (vectorized in a lazy way) that breaks down the format and then replaces based on a character or numeric:

 randpc <- Vectorize(function(s){ s = strsplit(s,"")[[1]] NUMS = as.character(0:9) nLet = sum(s %in% LETTERS) nDig = sum(s %in% NUMS) s[s %in% LETTERS] = sample(LETTERS, nLet, replace=TRUE) s[s %in% NUMS] = sample(NUMS, nDig, replace=TRUE) paste0(s, collapse="") }) 

Has a useful side effect of returning a named vector that shows a format string:

 > randpc(c("AA9 9AA","A9 9AA")) AA9 9AA A9 9AA "QS4 4LW" "S9 7EU" 

It is also flexible as it can create postal codes based on a different postal code as it accepts any letter or number in the format string:

 > randpc(rep("LA1 4YF",3)) LA1 4YF LA1 4YF LA1 4YF "OL2 5OJ" "YK3 3YB" "FV0 1LW" 
+1
source

I'm not sure what is considered brute force, since the split-replace-comb workflow on strings seemed most intuitive to me. However, my first attempts were rather slow with a very large number of templates. I also hoped that something like stri_replace_all(replacement = sample(LETTERS, 1)) would work, but it was also replaced only with the same letter.

This is a slightly different approach, using stri_replace_first , replacing the first instance of the template until there are no template characters left. This means that I am changing the pattern to lowercase l for letters and n for numbers, since zip codes have only capital letters and numbers (as far as I know). I think that the runtime is much more reasonable (~ 10 seconds) for 100 thousand Templates, and this also uses only stringi .

 library(stringi) make_postcodes <- function(templates){ postcodes <- templates while (any(stri_detect_regex(postcodes, "l|n"))){ for (i in 1:length(templates)){ postcodes[i] <- stri_replace_first_fixed( str = postcodes[i], pattern = "l", replacement = sample(LETTERS, 1) ) postcodes[i] <- stri_replace_first_fixed( str = postcodes[i], pattern = "n", replacement = sample(0:9, 1) ) } } postcodes } make_postcodes("ln nll") #> [1] "W8 3MX" make_postcodes(c("ln nll", "ln nll", "lnl nll")) #> [1] "H1 6TN" "C5 6YI" "A3I 2DB" test = stri_trim_both(stri_rand_strings(100000, sample(5:9, 1), pattern = "[nl\\ ]")) tictoc::tic("Time to convert 100,000 templates") x <- make_postcodes(test) tictoc::toc() #> Time to convert 100,000 templates: 12.03 sec elapsed head(test) #> [1] "lnnl" "ll l" "nl n" "ll l" "ll l" "ll n" head(x) #> [1] "G91U" "HU N" "2Q 7" "EU Z" "PD I" "SM 4" 

Created 2018-04-06 reprex package (v0.2.0).

0
source

Source: https://habr.com/ru/post/1276285/


All Articles