There are some weird things in your code.
What you call “permutations” is more like a Cartesian product that can be calculated using itertools.product .
Since Python is indexed with a null value, the first element of the line is index 0, so a comparison like i[2].find(sMotif) < 1 will return True if the line is right at the beginning, which seems a little strange.
Your OddsRatio, PValue and Enrichment calculations are inside the loop, but neither zeroing counts, nor print means that you calculate them cumulatively for each new line, but do nothing with this information.
You repeat i[2].find(sMotif) several times in the typical case. This result is not cached.
Assuming I understand the numbers you are trying to calculate, and I could be wrong because there are a few things you do, I don’t understand - I would flip the logic. Instead of sorting through each motive and trying to count it in each row, iterate over each row and see what is there. This will be approximately 7 * the number of lines instead of the number of motives * the number of lines.
For instance:
import random from itertools import product from collections import defaultdict, Counter N = 12000 datalength = 400 listoflists = [[str(i), random.uniform(-1, 1), ''.join([random.choice('AGCT') for c in range(datalength)])] for i in range(N)] def chunk(seq, width): for i in range(len(seq)-width+1): yield seq[i:i+width] def count_motifs(datatriples, width=7): motif_counts_by_down = defaultdict(Counter) nonmotif_counts_by_down = defaultdict(Counter) all_motifs = set(''.join(p) for p in product('AGCT',repeat=width)) for symbol, value, sdata in datatriples: down = value < -0.5
(I reduced the length of the line just to make the conclusion faster; if the line is 10 times longer, the code takes 10 times longer.)
This happens on my slow laptop (after pasting some lines):
>>> %time mot, nomot = count_motifs(listoflists, 7) CPU times: user 1min 50s, sys: 60 ms, total: 1min 50s Wall time: 1min 50s
Thus, I would have made about 20 minutes for the complete problem, which is not bad for such a small code. (We could speed up the motifs_not_seen part by doing arithmetic instead, but that would still give us a factor of two.)
In a much smaller case, where it is easier to see the conclusion:
>>> mot, nomot = count_motifs(listoflists, 2) >>> mot defaultdict(<class 'collections.Counter'>, {False: Counter({'CG': 61, 'TC': 58, 'AT': 55, 'GT': 54, 'CA': 53, 'GA': 53, 'AC': 52, 'CT': 51, 'CC': 50, 'AG': 49, 'TA': 48, 'GC': 47, 'GG': 45, 'TG': 45, 'AA': 43, 'TT': 40}), True: Counter({'CT': 27, 'GT': 26, 'TC': 24, 'GC': 23, 'TA': 23, 'AC': 22, 'AG': 21, 'TG': 21, 'CC': 19, 'CG': 19, 'CA': 19, 'GG': 18, 'TT': 17, 'GA': 17, 'AA': 16, 'AT': 16})}) >>> nomot defaultdict(<class 'collections.Counter'>, {False: Counter({'TT': 31, 'AA': 28, 'GG': 26, 'TG': 26, 'GC': 24, 'TA': 23, 'AG': 22, 'CC': 21, 'CT': 20, 'AC': 19, 'GA': 18, 'CA': 18, 'GT': 17, 'AT': 16, 'TC': 13, 'CG': 10}), True: Counter({'AA': 13, 'AT': 13, 'GA': 12, 'TT': 12, 'GG': 11, 'CC': 10, 'CA': 10, 'CG': 10, 'AG': 8, 'TG': 8, 'AC': 7, 'GC': 6, 'TA': 6, 'TC': 5, 'GT': 3, 'CT': 2})})