Find and replace characters before ":"

I have a file containing a certain number of lines. Each line looks like this:

TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1 

I would like to delete everything before the ":" symbol in order to save only PKMYT1, which is the name of the gene. Since I'm not an expert in regex scripts, can anyone help me do this using Unix (sed or awk) or in R?

+28
source share
9 answers

Here are two ways to do this in R:

 foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1" # Remove all before and up to ":": gsub(".*:","",foo) # Extract everything behind ":": regmatches(foo,gregexpr("(?<=:).*",foo,perl=TRUE)) 
+41
source

Simple regex used with gsub() :

 x <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1" gsub(".*:", "", x) "PKMYT1" 

See ?regex or ?gsub more details.

+10
source

In R. There are, of course, more than two ways. Here is another.

 unlist(lapply(strsplit(foo, ':', fixed = TRUE), '[', 2)) 

If the string is of constant length, I suppose substr will be faster than this or regular methods.

+9
source

Using sed:

 sed 's/.*://' < your_input_file > output_file 

This will replace everything followed by a colon, nothing will happen, so it will remove everything before and including the last colon on each line ( because * greedy by default ).

According to Josh O'Brien's comment, if you only want to replace the before and include the first colon, do the following:

 sed "s/[^:]*://" 

This will correspond to all that is not a colon followed by a single colon and does not replace anything.

Note that for both of these patterns, they will stop at the first match on each line. If you want the replacement to be performed for each match in the string, add the ' g ' (global) parameter to the end of the command.

Also note that on linux (but not OSX) you can edit the file in place using -i for example:

 sed -i 's/.*://' your_file 
+8
source

You can use awk as follows:

 awk -F: '{print $2}' /your/file 
+5
source

If you are using GNU coreutils , use cut :

 cut -d: -f2 infile 
+2
source

The following are two equivalent solutions:

The first uses the autosplit perl -a function to divide each line into fields using : filling in an array of fields F and printing the 2nd field $F[1] (counted starting from field 0)

 perl -F: -lane 'print $F[1]' file 

The second uses a regular expression to replace s/// from ^ beginning of the line ^ .*: Any characters ending with a colon, without anything

 perl -pe 's/^.*://' file 
0
source

I worked on a similar problem. Advice John and Josh O'Brien did the trick. I started with this question:

 library(dplyr) my_tibble <- tibble(Col1=c("ABC:Content","BCDE:MoreContent","FG:Conent:with:colons")) 

Looks like:

  | Col1 1 | ABC:Content 2 | BCDE:MoreContent 3 | FG:Content:with:colons 

I needed to create this tibet:

  | Col1 | Col2 | Col3 1 | ABC:Content | ABC | Content 2 | BCDE:MoreContent | BCDE | MoreContent 3 | FG:Content:with:colons| FG | Content:with:colons 

And I did it with this code (R version 3.4.2).

 my_tibble2 <- mutate(my_tibble ,Col2 = unlist(lapply(strsplit(Col1, ':',fixed = TRUE), '[', 1)) ,Col3 = gsub("^[^:]*:", "", Col1)) 
0
source

Some very simple move that I skipped from @Sacha Epskamp's best answer was to use a subfunction, in this case take everything before the ":" (instead of deleting it), so it was very simple:

 foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1" # 1st, as she did to remove all before and up to ":": gsub(".*:","",foo) # 2nd, to keep everything before and up to ":": gsub(":.*","",foo) 

Basically, the same thing, just change the position ":" inside the sub argument. Hope this helps.

0
source

Source: https://habr.com/ru/post/948062/


All Articles