In the following data, I am trying to run a simple Markov model.
Say I have data with the following structure:
pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8 1 ATTAAGACA|CCGCTTAGA 2 TGCTGTTGT|AATATCAAT 3 CAACAGTCC|GGACGCGCG 4 GTGTATCTG|TCTTTATCT
Block M represents data from one set of catergories, so block S.
Data is strings , which are created by connecting letters along the line of position. So the string value for M1 is ATCG , and therefore for every other block.
There is also one hybrid block that has two lines that read the same way. The question is, what do I want to find, which line in the hybrid block most likely came from which block (M vs S)?
I am trying to create a Markov model that can help me determine which row in a hybrid block appeared from blocks. In this example, I can say that in the hybrid block, ATCG came from block M and CAGT appeared from block S
I am breaking the problem into different parts in order to read and process the data:
Problem Level 01:
- First I read the first row (heading) and create
unique keys for all columns. - Then I read the 2nd line (
pos with a value of 1 ) and create another key. On the same line, I read the value from hybrid_block and read the value of the string in it. pipe | is just a separator, so the two lines are in index 0 and 2 as A and C So all I want from this line is
defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}
Like, I am progressing with reading a row, I want to add row values from each column and finally create.
defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}
Problem Level 02:
I read the data in hybrid_block for the first row, which are A and C
Now I want to create keys' but unlike fixed keys, these key will be generated while reading the data from hybrid_blocks . For the first line since there are no preceding line the . For the first line since there are no preceding line the keys will simply be AgA and CgC which means (A given A, and C given C), and for the values I count the number of A in block M and block S`. Thus, the data will be saved as:
defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}
As I read the other lines, I want to create new keys based on the lines in the hybrid block and count the number of lines present in the M vs S block, provided that the line is in the previous line. This means that keys when reading line 2 will be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found T on this line, after A on the previous line and same for AcG`.
defaultdict after reading 3 lines will be.
defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}
I understand that it looks too complicated. I went through several dictionary and defaultdict , but couldn't find a way to do this.
The solution to any part, if not both, is highly appreciated.