R with character blocks denoting nucleotide variants

My problem is that I need to find the position in the line where I have blocks of characters that should be only one character position. I work with nucleotide sequences where I need to track positions in a sequence, but I have some positions where there are options that were designated as [A / T], where either A or T may be present depending on which sequence I care (these are two similar DNA sequences that change in the positions of the pair in the entire sequence). Thus, for each of these site options, the sequence length is longer than four characters / positions longer.

I know that I can get around this by creating a new code into which you can convert [A / T], for example X, and [T / A] represents Y, but this will be confusing because there is already a standard but it will not track, from which nucleotide, from which strain (for me, the one that is before / is from strain A, and the one that is after / is from strain B). I want to somehow index this DNA sequence, I thought like this:

If I have a line like:

dna <- "ATC[A/T]G[G/C]ATTACAATCG" 

I would like to get the table /data.frame:

 pos nuc 1 A 2 T 3 C 4 [A/T] 5 G 6 [G/C] ... and so on 

I feel like I can use strplit somehow if I knew regex better. Is it possible to insert a condition for splitting into each character if it is not connected by square brackets, which should be stored as a block?

+6
source share
4 answers
 library('stringr') df <- as.data.frame(strsplit(gsub("\\[./.\\]", '_', dna), ''), stringsAsFactors=F) df[,1][df[,1] == '_'] <- str_extract_all(dna, "\\[./.\\]")[[1]];names(df) <- 'nuc' df # nuc # 1 A # 2 T # 3 C # 4 [A/T] # 5 G # 6 [G/C] # 7 A # 8 T # 9 T # 10 A # 11 C # 12 A # 13 A # 14 T # 15 C # 16 G 
+6
source

I'm the type of person who likes to keep things simple, here is a short trick ...

 x <- 'ATC[A/T]G[G/C]ATTACAATCG' data.frame(nuc = regmatches(x, gregexpr('\\[[^]]*]|.', x))[[1]]) # nuc # 1 A # 2 T # 3 C # 4 [A/T] # 5 G # 6 [G/C] # 7 A # 8 T # 9 T # 10 A # 11 C # 12 A # 13 A # 14 T # 15 C # 16 G 

In the above regular expression, alternation is used, on the left side we match the substrings that are inside the square brackets, on the right side we use . that matches any single character.

+5
source

Here is another

 dna <- "ATC[A/T]G[G/C]ATTACAATCG" (tmp <- gsub('(\\w)(\\w)','~\\1~\\2~', dna)) # [1] "~A~T~C[A/T]G[G/C]~A~T~~T~A~~C~A~~A~T~~C~G~" (nuc <- Filter(nzchar, strsplit(gsub("(\\[.+?\\])","~\\1~", tmp), '~')[[1]])) # [1] "A" "T" "C" "[A/T]" "G" "[G/C]" "A" "T" "T" # [10] "A" "C" "A" "A" "T" "C" "G" data.frame(nuc) # nuc # 1 A # 2 T # 3 C # 4 [A/T] # 5 G # 6 [G/C] # 7 A # 8 T # 9 T # 10 A # 11 C # 12 A # 13 A # 14 T # 15 C # 16 G 
+3
source

So, an easy way to get everything away from the characters in square brackets:

 strsplit(dna, '\\[[AZ]/[AZ]\\]') [[1]] [1] "ATC" "G" "ATTACAATCG" 

Perhaps denying that it will give you something inside the brackets, or use the regular expression in the argument I gave.

EDIT: Here is the code that will give you what is between the brackets:

 lbracket <- as.numeric(unlist(gregexpr('\\[', dna))) rbracket <- as.numeric(unlist(gregexpr('\\]', dna))) mapply(function(x, y) substr(dna, start=x, stop=y), lbracket, rbracket) [1] "[A/T]" "[G/C]" 

That should work.

+1
source

Source: https://habr.com/ru/post/990049/


All Articles