R: extracting part of the file name

I am trying to extract part of the file name using R, I have a vague idea of ​​how to do this from here: extract part of the file name in R however I cannot get this to work on my list of file names

example file names:

"Species Count (2011-12-15-07-09-39).xls" "Species Count 0511.xls" "Species Count 151112.xls" "Species Count1011.xls" "Species Count2012-01.xls" "Species Count201207.xls" "Species Count2013-01-15.xls" 

Some file names have a space between the number of views and the date, some without a space, and they have different lengths, and some contain brackets. I just want to extract the digital part of the file name and save it. So, for example, for the above data, I would:

Expected Result:

 2011-12-15-07-09-39 , 0511 , 151112 , 1011 , 2012-01 , 201207 , 2013-01-15 
+4
source share
4 answers

Here is one way:

 regmatches(tt, regexpr("[0-9].*[0-9]", tt)) 

I assume there are no other numbers in the file names. So, we just look for the beginning of the number and use the greedy operator .* , So that's it until the last number is fixed. This is done using regexpr , which will get a match position. Then we use regmatches to extract (sub) rows from these matched positions.


where tt :

 tt <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", "Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls", "Species Count201207.xls", "Species Count2013-01-15.xls") 

Benchmarking:

Note. Test results may vary between Windows machines and * nix (as @Hansi notes below under comments).

Pretty good answers there. So, this is the time for benchmarking :)

 tt <- rep(tt, 1e5) # tt is from above require(microbenchmark) require(stringr) aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt)) bb <- function() gsub("[Az \\.\\(\\)]", "", tt) cc <- function() str_extract(tt,'([0-9]|[0-9][-])+') microbenchmark(arun <- aa(), agstudy <- cc(), Jean <- bb(), times=25) Unit: seconds expr min lq median uq max neval arun <- aa() 1.951362 2.064055 2.198644 2.397724 3.236296 25 agstudy <- cc() 2.489993 2.685285 2.991796 3.198133 3.762166 25 Jean <- bb() 7.824638 8.026595 9.145490 9.788539 10.926665 25 identical(arun, agstudy) # TRUE identical(arun, Jean) # TRUE 
+4
source

Use the gsub() function to remove all letters, spaces, periods, and parentheses. Then you will be left with numbers and a hyphen. For instance,

 x <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", "Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls", "Species Count201207.xls", "Species Count2013-01-15.xls") gsub("[Az \\.\\(\\)]", "", x) [1] "2011-12-15-07-09-39" "0511" "151112" [4] "1011" "2012-01" "201207" [7] "2013-01-15" 
+4
source

If you are concerned about speed, you can use sub with backlinks to extract the snippets you need. Also note that perl=TRUE often faster (according to ?grep ).

 jj <- function() sub("[^0-9]*([0-9].*[0-9])[^0-9]*", "\\1", tt, perl=TRUE) aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt, perl=TRUE)) # Run on R-2.15.2 on 32-bit Windows microbenchmark(arun <- aa(), josh <- jj(), times=25) # Unit: milliseconds # expr min lq median uq max # 1 arun <- aa() 2156.5024 2189.5168 2191.9972 2195.4176 2410.3255 # 2 josh <- jj() 390.0142 390.8956 391.6431 394.5439 493.2545 identical(arun, josh) # TRUE # Run on R-3.0.1 on 64-bit Ubuntu microbenchmark(arun <- aa(), josh <- jj(), times=25) # Unit: seconds # expr min lq median uq max neval # arun <- aa() 1.794522 1.839044 1.858556 1.894946 2.207016 25 # josh <- jj() 1.003365 1.008424 1.009742 1.059129 1.074057 25 identical(arun, josh) # still TRUE 
+2
source

Using the stringr package to retrieve all strings that have only digits or digits followed by - :

 library(stringr) str_extract(ll,'([0-9]|[0-9][-])+') [1] "2011-12-15-07-09-39" "0511" "151112" "1011" "2012-01" [6] "201207" "2013-01-15" 
+1
source

Source: https://habr.com/ru/post/1495540/


All Articles