Extract part of the file name in R

I am trying to write code to open all the data files in a folder, apply a function (or set of functions) to retrieve my data of interest. So far, so good. The problem is that I would like to rename one of the columns that I extract from each file using a single element of the file name, and it is difficult for me to determine how to extract it.

I have a bunch of files named "YYYY-MM-DD geneName data copy.txt" and would like to extract part of the name "geneName" in the file name. (For example, I have "2012-05-31 PMA1 data copy.txt".)

The date format is always the same (YYYY-MM-DD), and all file names end with "data copy.txt".

In addition, some file names contain additional annotation of the experiment (either "E (number)" or "Expt (number)" in the file name between the date and geneName (for example, "2012-05-21 E7 PMA1 data copy.txt "); others have an "SDM" between geneName and "data copy.txt".

Here are some file names and my desired result:

  • 2012-05-31 CTN1 data copy.txt (want CTN1)
  • 2012-05-21 E7 PMA1 data copy.txt (want "PMA1")
  • 2011-11-29 TDH3 SDM data copy.txt (want "TDH3")
  • 2012-01-04 POX1 data copy.txt (want "POX1")

Any thoughts on how I can do this without manually deleting the experiment number or β€œSDM” from some files?

Thanks!

+1
source share
1 answer

Below is a date, an optional digit E \ digit or Expt \ that you don’t want, the word you want, then an additional SDM that you don’t need, followed by "data copy.txt" ...

Here are my test data:

> names [1] "2012-05-31 CTN1 data copy.txt" [2] "2012-05-21 E7 PMA1 data copy.txt" [3] "2011-11-29 TDH3 SDM data copy.txt" [4] "2012-01-04 POX1 data copy.txt" [5] "2011-11-29 ECHO data copy.txt" [6] "2011-11-29 E8 ECHO data copy.txt" [7] "2011-11-29 ECHO SDM data copy.txt" [8] "2011-11-29 Expt2 ECHO SDM data copy.txt" 

and here is my sub :

 > sub(pattern="^....-..-.. (E\\d+ |Expt\\d+ )*(\\w+) (SDM )*data copy.txt","\\2",names) [1] "CTN1" "PMA1" "TDH3" "POX1" "ECHO" "ECHO" "ECHO" "ECHO" 

If your E-prefixes have more than one digit, this will also work. I tried adding some things to my test suite, starting with E , to make sure that they were handled correctly, as well as in the case of E-prefix and SDM.

+3
source

Source: https://habr.com/ru/post/1495543/


All Articles