How to read two lines from a file and create dynamic keys in a loop, track

This question follows the issue in question: How to read two lines from a file and create dynamic keys in a for loop?

But the nature of the problem has evolved to a certain complexity, which I want to address.

Below is the structure of my data, separated by a space.

chr pos M1 M2 Mk Mg1 F1_hybrid F1_PG F1_block S1 Sk1 S2 Sj 2 16229767 T/TT/TT/TG/TC|T 1|0 726 . T/CT/CT/C 2 16229783 C/CC/CC/CA/CG|C 0|1 726 G/CG/CG/CC|G 2 16229992 A/AA/AA/AG/AG|A 1|0 726 A/AA/AA/AA|G 2 16230007 T/TT/TT/TA/TA|T 1|0 726 A|TA|TA|TA|T 2 16230011 G/GG/GG/GG/GC|G 1|0 726 G/CC|GC|GG/C 2 16230049 A/AA/AA/AA/AT|A 1|0 726 A|T . A/TA/T 2 16230174 . . . C/CT|C 1|0 726 C|TT|CT|CC|T 2 16230190 A/AA/AA/AA/AT|A 1|0 726 T|GG|TT|GT|G 2 16230260 A/AA/AA/AA/AG|A 1|0 726 G/GG/GG/GG/G 

Explanation:

  • There are two main categories of data in the above file. Data from Group M has a pattern name starting with M , as well as group S , which has several column names starting with S.

  • And there is a hybrid column (represented by F1_hybrid ).

  • data is a line along a line of a position. F1_hybrid in stages with a pipe (|) distinguishing two letters. So the values ​​of the two rows from F1 are CGGACTTTG and the other string value is TCATGACAA. One of this line is from M-group , and the other is from S-group , but I need to do some statistical analysis for this. However, I can say visually that the TCATGACAA string was most likely derived from the M-group.

Procedure:

  • I read the first row and create unique keys using column information.

  • Then I read the second and third rows and values ​​in F1_hybrid, which is C | T with G | C. Now I need to calculate how many GgC (explained as G given C) and CgT (C given T) exist between the M-group and the group S.

  • Then read the 3rd (G | C) with the 4th (G | A) line in F1_hybrid. So, the states are GgG and AgC. In the same way, now I believe that in the group M and S. there are many GcG and AgC.

So I am trying to build a Markov-model that counts the number of states for a phased row from F1 and takes the observed counts in Group M vs group S

Now I explain how to count the number of any XgY (X given by Y) based on F1_hyrbid:

  • It is important to note the conditions before making an account. The existing condition can be phased (which is represented by a pipe) compared to non-spherical (if two lines have at least one forward slash (/).

Condition 01:

Sample M1 has state as (T / T with C / C) for 2nd and 3rd rows. since the delimiter is a slash (/) and not a pipe (|) , we cannot determine what state the M1 sample is in. But we can create a combination matrix (for the previous state with the current state)

  TT C CgT CgT C CgT CgT 

Now we can say that there are 4 common CgT

and we continue to make the same matrix if this condition is satisfied.

State 02

The same applies to other samples from group M, with the exception of Mg1, where G / T precedes A / C. So the matrix:

  GT A AgG AgT C CgG CgT 

So, here we observed 1 count of CgT.

Condition 03:

But if the previous state is the current state in stages along the pipe in both states (for example, A | T at position 16230007 with C | G at position 16230011 for sample Sk1) , we can make a direct amount of the phase state of the observed state in this position that only exist CgA and GgT, so the amount of CgT is 0.

Condition 04: If one of the states has pipe (|) and the other has a slash (/), the condition will be the same as the state with a slash.

Condition 05: If any of the previous_state or present_state is a period (.), The Case counter is automatically zero (0) for the state expected from F1_hybrid.

So, the expected result should be something like this:

 pos M1 M2 Mk Mg1 H0 H1 S1 Sk1 S2 Sj 16..9783 4-CgT 4-CgT 4-CgT 1-CgT GgC CgT 0 1-CgT 1-CgT 1-CgT 16..9992 4-AgC 4-AgC 4-AgC 2-AgC GgG AgC 1-AgC 1-AgC 1-AgC 1-AgC,1-GgG 16..0007 4-TgA 4-TgA 4-TgA 1-AgG,1-TgA AgG TgA 2-TgA 2-TgA 2-TgA1 1-TgA ..................contd 

Or, dictionary-formatted values ​​for each column will work the same way. Something like ['4-CgT','4-CgT','4-CgT','4-CgT'] for the first M1 at position 16..9783 and the same for the other.

+9
source share
1 answer

The question is a little old, but interesting, because you have a very clear specification, and you need help writing code. I will lay out the solution after a top-down approach, which is a very well-known method using simple old python. It’s easy to adapt to pandas.

The top-down approach means for me: if you don’t know how to write it, just name it!

You have a file (or line) as input, and you want to output the file (or line). This seems pretty straightforward, but you want to combine line pairs to build each new line. The idea is this:

  • get input lines like dictionaries
  • take them two
  • build a new line for each pair
  • output result

Until you know how to write a string generator. You do not know how to build a new line for each pair. Don't stay in touch with difficulties, just name solutions. Imagine you have a get_rows function and a build_new_row function. Let it be written:

 def build_new_rows(f): """generate the new rows. Output may be redirected to a file""" rows = get_rows(f) # get a generator on rows = dictionaries. r1 = next(rows) # store the first row for r2 in rows: # for every following row yield build_new_row(r1, r2) # yield a new row built of the previous stored row and the current row. r1 = r2 # store the current row, which becomes the previous row 

Now consider two β€œmissing” functions: get_rows and build_new_row . The get_rows function is simple enough to write. Here is the main part:

 header = process_line(next(f)) for line in f: yield {k:v for k,v in zip(header, process_line(line))} 

where process_line just breaks the line into a space, for example. with re.split("\s+", line.strip()) .

The second part is build_new_row . Nevertheless, the approach is from top to bottom: you need to build H0 and H1 from the expected table, and then build the score H1 for each M and S in accordance with the conditions that you set. Imagine you have a pipe_compute function that calculates the functions H0 and H1 and a build_count that builds an H1 score for each M and S:

 def build_new_row(r1, r2): """build a row""" h0, h1 = pipe_compute(r1["F1_hybrid"], r2["F1_hybrid"]) # initialize the dict whith the pos, H0 and H1 new_row = {"pos":r2["pos"], "H0":h0, "H1":h1} for key in r1.keys(): if key[0] in ("M", "S"): new_row[key] = build_count(r1[key], r2[key], h1) return new_row 

You have almost everything now. Take a look at pipe_compute : this is exactly what you wrote in your state 03.

 def pipe_compute(v1, v2): """build H0 H1 according to condition 03""" xs = v1.split("|") ys = v2.split("|") return [ys[0]+"g"+xs[0], ys[1]+"g"+xs[1]] 

And for buid_count stick with the top down approach:

 def build_count(v1, v2, to_count): """nothing funny here: just follow the conditions""" if is_slash_count(v1, v2): # are conditions 01, 02, 04 true ? c = slash_count(v1, v2)[to_count] # count how many "to_count" we find in the 2 x 2 table of condtions 01 or 02. elif "|" in v1 and "|" in v2: # condition 03 c = pipe_count(v1, v2)[to_count] elif "." in v1 or "." in v2: # condition 05 return '0' else: raise Exception(v1, v2) return "{}-{}".format(c, to_count) # n-XgY 

We are still falling. When do we have is_slash_count ? Two slashes (conditions 01 and 02) or one slash and one pipe (condition 04):

 def is_slash_count(v1, v2): """conditions 01, 02, 04""" return "/" in v1 and "/" in v2 or "/" in v1 and "|" in v2 or "|" in v1 and "/" in v2 

The slash_count function is just a 2 x 2 table of conditions 01 and 02:

 def slash_count(v1, v2): """count according to conditions 01, 02, 04""" cnt = collections.Counter() for x in re.split("[|/]", v1): # cartesian product for y in re.split("[|/]", v2): # cartesian product cnt[y+"g"+x] += 1 return cnt # a dictionary XgY -> count(XgY) 

The pipe_count function pipe_count even simpler because you just need to count the result of pipe_compute :

 def pipe_count(v1, v2): """count according to condition 03""" return collections.Counter(pipe_compute(v1, v2)) 

Now you are done (and down). I get this result, which is slightly different from your expectation, but you probably already saw my error (s):

 pos M1 M2 Mk Mg1 H0 H1 S1 Sk1 S2 Sj 16229783 4-CgT 4-CgT 4-CgT 1-CgT GgC CgT 0 1-CgT 1-CgT 1-CgT 16229992 4-AgC 4-AgC 4-AgC 1-AgC GgG AgC 2-AgC 2-AgC 2-AgC 1-AgC 16230007 4-TgA 4-TgA 4-TgA 1-TgA AgG TgA 2-TgA 2-TgA 2-TgA 0-TgA 16230011 4-GgT 4-GgT 4-GgT 2-GgT CgA GgT 1-GgT 1-GgT 1-GgT 1-GgT 16230049 4-AgG 4-AgG 4-AgG 4-AgG TgC AgG 1-AgG 0 1-AgG 1-AgG 16230174 0 0 0 4-CgA TgT CgA 1-CgA 0 1-CgA 1-CgA 16230190 0 0 0 4-AgC TgT AgC 0-AgC 0-AgC 0-AgC 0-AgC 16230260 4-AgA 4-AgA 4-AgA 4-AgA GgT AgA 0-AgA 0-AgA 0-AgA 0-AgA 

Bonus: Try it online!

It is important, in addition to solving this particular problem, the method that I used and which is widely used in software development. The code can be greatly improved.

+3
source

Source: https://habr.com/ru/post/1263560/


All Articles