My problem is that I need to find the position in the line where I have blocks of characters that should be only one character position. I work with nucleotide sequences where I need to track positions in a sequence, but I have some positions where there are options that were designated as [A / T], where either A or T may be present depending on which sequence I care (these are two similar DNA sequences that change in the positions of the pair in the entire sequence). Thus, for each of these site options, the sequence length is longer than four characters / positions longer.
I know that I can get around this by creating a new code into which you can convert [A / T], for example X, and [T / A] represents Y, but this will be confusing because there is already a standard but it will not track, from which nucleotide, from which strain (for me, the one that is before / is from strain A, and the one that is after / is from strain B). I want to somehow index this DNA sequence, I thought like this:
If I have a line like:
dna <- "ATC[A/T]G[G/C]ATTACAATCG"
I would like to get the table /data.frame:
pos nuc 1 A 2 T 3 C 4 [A/T] 5 G 6 [G/C] ... and so on
I feel like I can use strplit somehow if I knew regex better. Is it possible to insert a condition for splitting into each character if it is not connected by square brackets, which should be stored as a block?
Gregs source share