The question is a little old, but interesting, because you have a very clear specification, and you need help writing code. I will lay out the solution after a top-down approach, which is a very well-known method using simple old python. Itβs easy to adapt to pandas.
The top-down approach means for me: if you donβt know how to write it, just name it!
You have a file (or line) as input, and you want to output the file (or line). This seems pretty straightforward, but you want to combine line pairs to build each new line. The idea is this:
- get input lines like dictionaries
- take them two
- build a new line for each pair
- output result
Until you know how to write a string generator. You do not know how to build a new line for each pair. Don't stay in touch with difficulties, just name solutions. Imagine you have a get_rows function and a build_new_row function. Let it be written:
def build_new_rows(f): """generate the new rows. Output may be redirected to a file""" rows = get_rows(f)
Now consider two βmissingβ functions: get_rows and build_new_row . The get_rows function is simple enough to write. Here is the main part:
header = process_line(next(f)) for line in f: yield {k:v for k,v in zip(header, process_line(line))}
where process_line just breaks the line into a space, for example. with re.split("\s+", line.strip()) .
The second part is build_new_row . Nevertheless, the approach is from top to bottom: you need to build H0 and H1 from the expected table, and then build the score H1 for each M and S in accordance with the conditions that you set. Imagine you have a pipe_compute function that calculates the functions H0 and H1 and a build_count that builds an H1 score for each M and S:
def build_new_row(r1, r2): """build a row""" h0, h1 = pipe_compute(r1["F1_hybrid"], r2["F1_hybrid"])
You have almost everything now. Take a look at pipe_compute : this is exactly what you wrote in your state 03.
def pipe_compute(v1, v2): """build H0 H1 according to condition 03""" xs = v1.split("|") ys = v2.split("|") return [ys[0]+"g"+xs[0], ys[1]+"g"+xs[1]]
And for buid_count stick with the top down approach:
def build_count(v1, v2, to_count): """nothing funny here: just follow the conditions""" if is_slash_count(v1, v2):
We are still falling. When do we have is_slash_count ? Two slashes (conditions 01 and 02) or one slash and one pipe (condition 04):
def is_slash_count(v1, v2): """conditions 01, 02, 04""" return "/" in v1 and "/" in v2 or "/" in v1 and "|" in v2 or "|" in v1 and "/" in v2
The slash_count function is just a 2 x 2 table of conditions 01 and 02:
def slash_count(v1, v2): """count according to conditions 01, 02, 04""" cnt = collections.Counter() for x in re.split("[|/]", v1):
The pipe_count function pipe_count even simpler because you just need to count the result of pipe_compute :
def pipe_count(v1, v2): """count according to condition 03""" return collections.Counter(pipe_compute(v1, v2))
Now you are done (and down). I get this result, which is slightly different from your expectation, but you probably already saw my error (s):
pos M1 M2 Mk Mg1 H0 H1 S1 Sk1 S2 Sj 16229783 4-CgT 4-CgT 4-CgT 1-CgT GgC CgT 0 1-CgT 1-CgT 1-CgT 16229992 4-AgC 4-AgC 4-AgC 1-AgC GgG AgC 2-AgC 2-AgC 2-AgC 1-AgC 16230007 4-TgA 4-TgA 4-TgA 1-TgA AgG TgA 2-TgA 2-TgA 2-TgA 0-TgA 16230011 4-GgT 4-GgT 4-GgT 2-GgT CgA GgT 1-GgT 1-GgT 1-GgT 1-GgT 16230049 4-AgG 4-AgG 4-AgG 4-AgG TgC AgG 1-AgG 0 1-AgG 1-AgG 16230174 0 0 0 4-CgA TgT CgA 1-CgA 0 1-CgA 1-CgA 16230190 0 0 0 4-AgC TgT AgC 0-AgC 0-AgC 0-AgC 0-AgC 16230260 4-AgA 4-AgA 4-AgA 4-AgA GgT AgA 0-AgA 0-AgA 0-AgA 0-AgA
Bonus: Try it online!
It is important, in addition to solving this particular problem, the method that I used and which is widely used in software development. The code can be greatly improved.