How to read two lines from a file and create dynamic keys in a loop?

In the following data, I am trying to run a simple Markov model.

Say I have data with the following structure:

pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8 1 ATTAAGACA|CCGCTTAGA 2 TGCTGTTGT|AATATCAAT 3 CAACAGTCC|GGACGCGCG 4 GTGTATCTG|TCTTTATCT 

Block M represents data from one set of catergories, so block S.

Data is strings , which are created by connecting letters along the line of position. So the string value for M1 is ATCG , and therefore for every other block.

There is also one hybrid block that has two lines that read the same way. The question is, what do I want to find, which line in the hybrid block most likely came from which block (M vs S)?

I am trying to create a Markov model that can help me determine which row in a hybrid block appeared from blocks. In this example, I can say that in the hybrid block, ATCG came from block M and CAGT appeared from block S

I am breaking the problem into different parts in order to read and process the data:

Problem Level 01:

  • First I read the first row (heading) and create unique keys for all columns.
  • Then I read the 2nd line ( pos with a value of 1 ) and create another key. On the same line, I read the value from hybrid_block and read the value of the string in it. pipe | is just a separator, so the two lines are in index 0 and 2 as A and C So all I want from this line is

defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}

Like, I am progressing with reading a row, I want to add row values ​​from each column and finally create.

defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}

Problem Level 02:

  • I read the data in hybrid_block for the first row, which are A and C

  • Now I want to create keys' but unlike fixed keys, these key will be generated while reading the data from hybrid_blocks . For the first line since there are no preceding line the . For the first line since there are no preceding line the keys will simply be AgA and CgC which means (A given A, and C given C), and for the values I count the number of A in block M and block S`. Thus, the data will be saved as:

defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}

As I read the other lines, I want to create new keys based on the lines in the hybrid block and count the number of lines present in the M vs S block, provided that the line is in the previous line. This means that keys when reading line 2 will be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found T on this line, after A on the previous line and same for AcG`.

defaultdict after reading 3 lines will be.

defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}

I understand that it looks too complicated. I went through several dictionary and defaultdict , but couldn't find a way to do this.

The solution to any part, if not both, is highly appreciated.

+5
source share
1 answer

pandas setting

 from io import StringIO import pandas as pd import numpy as np txt = """pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8 1 ATTAAGACA|CCGCTTAGA 2 TGCTGTTGT|AATATCAAT 3 CAACAGTCC|GGACGCGCG 4 GTGTATCTG|TCTTTATCT """ df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos') df 

enter image description here

Decision

mostly pandas with some numpy


  • split hybrid column
  • add identical first row
  • add with shifted version of self to get strings like 'AgA'

 d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df]) d1 = pd.concat([ df.filter(like='M'), df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format), df.filter(like='S') ], axis=1) d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1]) d1 = d1.add('g').add(d1.shift()).dropna() d1 

enter image description here

Assign convenient blocks to your variable names

 m = d1.filter(like='M') s = d1.filter(like='S') h = d1.filter(like='H') 

Count how many in each block and combine

 mcounts = pd.DataFrame( (m.values[:, :, None] == h.values[:, None, :]).sum(1), h.index, h.columns ) scounts = pd.DataFrame( (s.values[:, :, None] == h.values[:, None, :]).sum(1), h.index, h.columns ) counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S']) counts 

enter image description here

If you really need a dictionary

 d = defaultdict(lambda:defaultdict(list)) dict_df = counts.stack().join(h.stack().rename('condition')).unstack() for pos, row in dict_df.iterrows(): d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')])) d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')])) d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')])) d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')])) dict(d) {'M': defaultdict(list, {'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)], 'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}), 'S': defaultdict(list, {'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)], 'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})} 
+6
source

Source: https://habr.com/ru/post/1263557/


All Articles