R with character blocks denoting nucleotide variants

Question

R with character blocks denoting nucleotide variants

My problem is that I need to find the position in the line where I have blocks of characters that should be only one character position. I work with nucleotide sequences where I need to track positions in a sequence, but I have some positions where there are options that were designated as [A / T], where either A or T may be present depending on which sequence I care (these are two similar DNA sequences that change in the positions of the pair in the entire sequence). Thus, for each of these site options, the sequence length is longer than four characters / positions longer.

I know that I can get around this by creating a new code into which you can convert [A / T], for example X, and [T / A] represents Y, but this will be confusing because there is already a standard but it will not track, from which nucleotide, from which strain (for me, the one that is before / is from strain A, and the one that is after / is from strain B). I want to somehow index this DNA sequence, I thought like this:

If I have a line like:

dna <- "ATC[A/T]G[G/C]ATTACAATCG"

I would like to get the table /data.frame:

 pos nuc 1 A 2 T 3 C 4 [A/T] 5 G 6 [G/C] ... and so on

I feel like I can use strplit somehow if I knew regex better. Is it possible to insert a condition for splitting into each character if it is not connected by square brackets, which should be stored as a block?

+6

string regex r bioinformatics

Gregs Jun 30 '15 at 19:38

source share

4 answers

I'm the type of person who likes to keep things simple, here is a short trick ...

 x <- 'ATC[A/T]G[G/C]ATTACAATCG' data.frame(nuc = regmatches(x, gregexpr('\\[[^]]*]|.', x))[[1]]) # nuc # 1 A # 2 T # 3 C # 4 [A/T] # 5 G # 6 [G/C] # 7 A # 8 T # 9 T # 10 A # 11 C # 12 A # 13 A # 14 T # 15 C # 16 G

In the above regular expression, alternation is used, on the left side we match the substrings that are inside the square brackets, on the right side we use . that matches any single character.

+5

hwnd Jun 30 '15 at 21:10

source share

Here is another

 dna <- "ATC[A/T]G[G/C]ATTACAATCG" (tmp <- gsub('(\\w)(\\w)','~\\1~\\2~', dna)) # [1] "~A~T~C[A/T]G[G/C]~A~T~~T~A~~C~A~~A~T~~C~G~" (nuc <- Filter(nzchar, strsplit(gsub("(\\[.+?\\])","~\\1~", tmp), '~')[[1]])) # [1] "A" "T" "C" "[A/T]" "G" "[G/C]" "A" "T" "T" # [10] "A" "C" "A" "A" "T" "C" "G" data.frame(nuc) # nuc # 1 A # 2 T # 3 C # 4 [A/T] # 5 G # 6 [G/C] # 7 A # 8 T # 9 T # 10 A # 11 C # 12 A # 13 A # 14 T # 15 C # 16 G

+3

rawr Jun 30 '15 at 20:22

source share

So, an easy way to get everything away from the characters in square brackets:

 strsplit(dna, '\\[[AZ]/[AZ]\\]') [[1]] [1] "ATC" "G" "ATTACAATCG"

Perhaps denying that it will give you something inside the brackets, or use the regular expression in the argument I gave.

EDIT: Here is the code that will give you what is between the brackets:

 lbracket <- as.numeric(unlist(gregexpr('\\[', dna))) rbracket <- as.numeric(unlist(gregexpr('\\]', dna))) mapply(function(x, y) substr(dna, start=x, stop=y), lbracket, rbracket) [1] "[A/T]" "[G/C]"

That should work.

+1

Chris watson Jun 30 '15 at 19:51

source share

Pierre lafortune · Accepted Answer · 2015-06-30T20:06:21+0000

 library('stringr') df <- as.data.frame(strsplit(gsub("\\[./.\\]", '_', dna), ''), stringsAsFactors=F) df[,1][df[,1] == '_'] <- str_extract_all(dna, "\\[./.\\]")[[1]];names(df) <- 'nuc' df # nuc # 1 A # 2 T # 3 C # 4 [A/T] # 5 G # 6 [G/C] # 7 A # 8 T # 9 T # 10 A # 11 C # 12 A # 13 A # 14 T # 15 C # 16 G

R with character blocks denoting nucleotide variants

More articles: