R Regexp - 5-digit extraction number

I have a line a similar to this:

stundenwerte_FF_00691_19260101_20131231_hist.zip

and I’d like to extract the 5-digit number β€œ00691” from it.

I tried using gregexpr and regmatches as well as stringr::str_extract but couldn't figure out the correct rexexp. I got to:

gregexpr ("[: digits {5}:]", a)

Which should return 5 digit numbers, and I don’t understand how to fix it.
This does not work: (

 m <- gregexpr("[:digits{5}:]",a) regmatches(a,m) 

Thanks for your help in advance!

+5
source share
4 answers

You can just use sub to capture numbers, for this simple case, no IMO regmatches .

 x <- 'stundenwerte_FF_00691_19260101_20131231_hist.zip' sub('\\D*(\\d{5}).*', '\\1', x) # [1] "00691" 

Edit: If you have other lines containing numbers in front, you will slightly modify the expression.

 sub('.*_(\\d{5})_.*', '\\1', x) 
+8
source

1) sub

 sub(".*_(\\d{5})_.*", "\\1", x) ## [1] "00691" 

2) gsubfn :: strapplyc The regular expression can be slightly simplified if we use strapplyc :

 library(gsubfn) strapplyc(x, "_(\\d{5})_", simplify = TRUE) ## [1] "00691" 

3) strsplit If we know that this is the third field:

 read.table(text = x, sep = "_", colClasses = "character")$V3 ## [1] "00691" 

3a) or

 strsplit(x, "_")[[1]][3] ## [1] "00691" 
+5
source

You can try the following regex that uses negative search statements. We cannot use word boundaries here like \\b\\d{5}\\b , because the previous and next _ characters fall under \w

 > x <- "stundenwerte_FF_00691_19260101_20131231_hist.zip" > m <- regexpr("(?<!\\d)\\d{5}(?!\\d)", x, perl=TRUE) > regmatches(x, m) [1] "00691" > m <- gregexpr("(?<!\\d)\\d{5}(?!\\d)", x, perl=TRUE) > regmatches(x, m)[[1]] [1] "00691" 

Explanation:

  • (?<!\\d) A negative lookbehind states that the preceding match will be any, but not a number.
  • \\d{5} Match exactly 5 digits.
  • (?!\\d) A negative look says that the character following the match will be any, but not a number.
+4
source

Let the line be:

 ss ="stundenwerte_FF_00691_19260101_20131231_hist.zip" 

You can split the string and list the substrings:

 ll = unlist(strsplit(ss,'_')) 

Then get the indices of the substrings set to TRUE if they are 5 characters long:

 idx = sapply(ll, nchar)==5 

And get those that are 5 characters long:

 ll[idx] [1] "00691" 
+1
source

Source: https://habr.com/ru/post/1205483/


All Articles