Iterate through a list in python

I have a list of l sequences (many 1000 sequences): l = [ABCD,AABA,...] . I also have an f file with many 4 letter sequences (about a million of them). I want to select the closest row in l for each sequence in f to a Hamming distance of almost 2, and update the good_count counter. I wrote the following code for this, but it is very slow. I was wondering if this could be done faster.

 def hamming(s1, s2): if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) f = open("input.txt","r") l = [ABCD,AABA,...] good_count = 0 for s in f: x = f.readline() dist_array = [] for ll in l: dist = hamming(x,ll) dist_array.append(dist) min_dist = min(dist_array) if min_dist <= 2: good_count += 1 print good_count 

It works fast if l and f are small, but too long for large l and f . Please suggest a faster solution.

+5
source share
2 answers

Use existing libraries, such as jellyfish:

 from jellyfish import hamming_distance 

Which gives you an implementation of C at a distance from interference.

+2
source

Oh, you just think how MUCH matches with distance from interference <2? This can be done much faster.

 total_count = 0 for line in f: # skip the s = f.readline() since that what `line` is in this line = line.strip() # just in case for ll in l: if hamming(line, ll) <= 2: total_count += 1 break # skip the rest of the ll in l loop # and then you don't need any processing afterwards either. 

Note that most of your code time will be spent on a line:

  if hamming(line, ll) <= 2: 

Thus, in any way that you can improve this algorithm, the LAST improvements are your overall script speed. Bude's answer praises the virtues of the jellyfish hamming_distance function, but without any personal experience, I cannot recommend it myself. However, his advice to use a faster implementation of distance for interference is sound!


Peter DeGlopper suggests expanding the l list into six different sets of matches of “Two or less hits.” That is, a group of sets that contains all possible pairs that can have two or less distances from hamming. It might look like this:

 # hamming_sets is [ {AB??}, {A?C?}, {A??D}, {?BC?}, {?B?D}, {??CD} ] hamming_sets = [ set(), set(), set(), set(), set(), set() ] for ll in l: # this should take the lion share of time in your program hamming_sets[0].add(l[0] + l[1]) hamming_sets[0].add(l[0] + l[2]) hamming_sets[0].add(l[0] + l[3]) hamming_sets[0].add(l[1] + l[2]) hamming_sets[0].add(l[1] + l[3]) hamming_sets[0].add(l[2] + l[3]) total_count = 0 for line in f: # and this should be fast, even if `f` is large line = line.strip() if line[0]+line[1] in hamming_sets[0] or \ line[0]+line[2] in hamming_sets[1] or \ line[0]+line[3] in hamming_sets[2] or \ line[1]+line[2] in hamming_sets[3] or \ line[1]+line[3] in hamming_sets[4] or \ line[2]+line[3] in hamming_sets[5]: total_count += 1 

You could get readability by making a hamming_sets dictionary of transform_function: set_of_results key value pairs.

 hamming_sets = {lambda s: s[0]+s[1]: set(), lambda s: s[0]+s[2]: set(), lambda s: s[0]+s[3]: set(), lambda s: s[1]+s[2]: set(), lambda s: s[1]+s[3]: set(), lambda s: s[2]+s[3]: set()} for func, set_ in hamming_sets.items(): for ll in l: set_.add(func(ll)) total_count = 0 for line in f: line = line.strip() if any(func(line) in set_ for func, set_ in hamming_sets.items()): total_count += 1 
+2
source

Source: https://habr.com/ru/post/1262954/


All Articles