Extract numbers from strings, including '|'

I have data where some elements are numbers separated by a "|", for example:

head(mintimes) [1] "3121|3151" "1171" "1351|1381" "1050" "" "122" head(minvalues) [1] 14 10 11 31 Inf 22 

What I would like to do is extract all the time and match them to the minimum values. The result is something like:

 times values 3121 14 3151 14 1171 10 1351 11 1381 11 1050 31 122 22 

I tried strsplit(mintimes, "|") and I tried str_extract(mintimes, "[0-9]+") , but they don't seem to work. Any ideas?

+6
source share
8 answers

| is a regular expression metacharacter. When used literally, these special characters must be escaped with either [] or \\ (or you can use fixed = TRUE in some functions). So your call to strsplit() should be

 strsplit(mintimes, "[|]") 

or

 strsplit(mintimes, "\\|") 

or

 strsplit(mintimes, "|", fixed = TRUE) 

As for your other attempt with stringr functions, str_extract_all() seems to do the trick.

 library(stringr) str_extract_all(mintimes, "[0-9]+") 

To get the desired result ,

 > mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") > minvalues <- c(14, 10, 11, 31, Inf, 22) > s <- strsplit(mintimes, "[|]") > data.frame(times = as.numeric(unlist(s)), values = rep(minvalues, sapply(s, length))) # times values # 1 3121 14 # 2 3151 14 # 3 1171 10 # 4 1351 11 # 5 1381 11 # 6 1050 31 # 7 122 22 
+6
source

By default, strsplit is split using regular expression and "|" is a special symbol of regular expression syntax. You can either run away from him

 strsplit(mintimes,"\\|") 

or just set fixed = T to not use regular expressions

 strsplit(mintimes,"|", fixed=T) 
+4
source

I wrote a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733

Using:

 cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long") # mintimes minvalues # 1: 3121 14 # 2: 3151 14 # 3: 1171 10 # 4: 1351 11 # 5: 1381 11 # 6: 1050 31 # 7: 122 22 

It also has a β€œwide” setting, if that is at all useful to you:

 cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide") # minvalues mintimes_1 mintimes_2 # 1: 14 3121 3151 # 2: 10 1171 NA # 3: 11 1351 1381 # 4: 31 1050 NA # 5: Inf NA NA # 6: 22 122 NA 

Note. Output: data.table .

+3
source

As mentioned, you need to avoid | to literally include it in the regular expression. As always, we can mow this cat in many ways, and here is one way to do this with stringr :

 x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") library(stringr) unlist(str_extract_all(x, "\\d+")) # [1] "3121" "3151" "1171" "1351" "1381" "1050" "122" 

This will not work as expected if you have decimal points in the character string of numbers, so the following (which says it matches anything other than | ) may be more secure:

 unlist(str_extract_all(x, '[^|]+')) # [1] "3121" "3151" "1171" "1351" "1381" "1050" "122" 

In any case, you can wrap the result in as.numeric .

+2
source

And here is another solution using stri_split_fixed from stringi package. As an added value, we also play with mapply and do.call .

Input data:

 mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") minvalues <- c(14, 10, 11, 31, Inf, 22) 

Split mintimes wrt | and convert to numeric:

 library("stringi") mintimes <- lapply(stri_split_fixed(mintimes, "|"), as.numeric) ## [[1]] ## [1] 3121 3151 ## ## [[2]] ## [1] 1171 ## ## [[3]] ## [1] 1351 1381 ## ## [[4]] ## [1] 1050 ## ## [[5]] ## [1] NA ## ## [[6]] ## [1] 122 

The column associates each minvalues with the corresponding mintimes :

 tmp <- mapply(cbind, mintimes, minvalues) ## [[1]] ## [,1] [,2] ## [1,] 3121 14 ## [2,] 3151 14 ## ## [[2]] ## [,1] [,2] ## [1,] 1171 10 ## ## [[3]] ## [,1] [,2] ## [1,] 1351 11 ## [2,] 1381 11 ## ## [[4]] ## [,1] [,2] ## [1,] 1050 31 ## ## [[5]] ## [,1] [,2] ## [1,] NA Inf ## ## [[6]] ## [,1] [,2] ## [1,] 122 22 

Link-link all 6 matrices and remove NA -rows:

 res <- do.call(rbind, tmp) res[!is.na(res[,1]),] ## [,1] [,2] ## [1,] 3121 14 ## [2,] 3151 14 ## [3,] 1171 10 ## [4,] 1351 11 ## [5,] 1381 11 ## [6,] 1050 31 ## [7,] 122 22 
+2
source

To get the result you want, try something like this:

 library(dplyr) Split.Times <- function(x) { mintimes <- as.numeric(unlist(strsplit(as.character(x$mintimes), "\\|"))) return(data.frame(mintimes = mintimes, minvalues = x$minvalues, stringsAsFactors=FALSE)) } df <- data.frame(mintimes, minvalues, stringsAsFactors=FALSE) df %>% filter(mintimes != "") %>% group_by(mintimes) %>% do(Split.Times(.)) 

This gives:

  mintimes minvalues 1 1050 31 2 1171 10 3 122 22 4 1351 11 5 1381 11 6 3121 14 7 3151 14 

(I lent my answer here - this is almost the same question / problem)

+1
source

Here is a qdap batch approach:

 mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") minvalues <- c(14, 10, 11, 31, Inf, 22) library(qdap) list2df(setNames(strsplit(mintimes, "\\|"), minvalues), "times", "values") ## times values ## 1 3121 14 ## 2 3151 14 ## 3 1171 10 ## 4 1351 11 ## 5 1381 11 ## 6 1050 31 ## 7 122 22 
+1
source

You can use [: punct:]

 strsplit(mintimes, "[[:punct:]]") 
0
source

Source: https://habr.com/ru/post/970914/


All Articles