Counting python number of presence and absence of substrings in a sequence list

Question

Counting python number of presence and absence of substrings in a sequence list

You can get the data here! 2shared below download

I am analyzing biological data using python.

I wrote down the code to find the corresponding substrings in the list of long string lists.

Substrings are listed and have a length of 7 nucleotides.

So, in the list from AAAAAAA to TTTTTTTT there are 16384 motifs (substrings) rearranging A, C, G, T.

This code has a for loop for a list of substrings and a list of lists of long strings nested inside.

It works fine, but because of the list of lists with 12,000 lines, the code is very slow.

In other words, to provide information about AAAAAAA and the next, which is AAAAAAC, takes 2 minutes.

so it takes 16384 motives to go through 12000 lines in 2 minutes, it will take (16384 * 2 == 32768 min → 546 hours → 22 days ...)

I use scipy and numpy to get Pvalues values.

What I want counts the number of presence and absence of substrings in the sequence list

The list of long lines and code is as follows:

list_of_lists_long = [ [BGN, -0.054, AGGCAGCTGCAGCCACCGCGGGGCCTCAGTGGGGGTCTCTGG....] [ABCB7, 0.109, GTCACATAAGACATTTTCTTTTTTTGTTGTTTTGGACTACAT....] [GPR143, -0.137, AGGGGATGTGCTGGGGGTCCAGACCCCATATTCCTCAGACTC....] [PLP2, -0.535, GCGAACTTCCCTCATTTCTCTCTGCAATCTGCAAATAACTCC....] [VSIG4, 0.13, AAATGCCCCATTAGGCCAGGATCTGCTGACATAATTGCCTAG....] [CCNB3, -0.071, CAGCAGCCACAGGGCTAAGCATGCATGTTAACAGGATCGGGA....] [TCEAL3, 0.189, TGCCTTTGGCCTTCCATTCTGATTTCTCTGATGAGAATACGA....] ....] #12000 lines

Is there faster logic for faster code execution?

I need your help!

Thanks in advance.

=======================================

Is there any simpler method without any other things?

I think iteration for pattern matching is the problem ...

what I'm trying to find is BOTH how many times a motive of length 7 is found in the entire list of sequences AND WILL NOT ALSO BE DETECTED !!!. Therefore, if the motive is present in the line, which has the value TRUE as bool, then increase the value AND FALSE, then increase the other value.

NOT the number of motives within the string.

+6

python string numpy scipy bioinformatics

Karyo Nov 16 '13 at 15:53

source share

6 answers

Great question. This is a classic computer science problem. Yes, there is a better algorithm. Your process processes each long line 16384 times. It is best to process each long line only once.

Instead of looking for each motive in each long line, instead you just need to write down what motives appear in each long line. For example, if you searched for the length of 2 motives in the following line:

 s = 'ACGTAC'

then you can run a loop over substrings of length 2 and write down which ones are present in the dict :

 motifAppearances = {} for i in range(len(s)-1): motif = s[i:i+2] # grab a length=2 substring if motif not in motifAppearances: motifAppearances[motif] = 0 # initialize the count motifAppearances[motif] += 1 # increment the count

Now you have processed the entire line exactly once and found all the motives present in it. In this case, the resulting dict will look like this:

 motifAppearances = {'AC':2, 'CG':1, 'GT':1, 'TA':1}

Doing something similar for your case should reduce the execution time by only 16,384 times.

+5

dg99 Nov 16 '13 at 16:36

source share

A clean and very fast way (~ 15 with OP data) would be to use CountVectorizer scikits-learn , since it uses numpy under the hood, for example:

 from sklearn.feature_extraction.text import CountVectorizer def make_chunks(s): width = 2 return [s[i:i+width] for i in range(len(s)-width+1)] l = ['ATTGCGGCTCACGAA', 'ACCTAGATACGACGG', 'CCCCTGTCCATGGTA'] vectorizer = CountVectorizer(tokenizer=make_chunks) X = vectorizer.fit_transform(l)

Now X is a sparse matrix having all possible pieces in the form of columns and a sequence in the form of rows, where each value represents the number of occurrences of this fragment in each sequence:

 >>> X.toarray() # aa ac ag at ca cc cg ... [[1 1 0 1 1 0 2 1 1 2 1 0 0 1 1 1] # ATTGCGGCTCACGAA [0 3 1 1 0 1 2 1 2 0 1 0 2 0 0 0] # ACCTAGATACGACGG [0 0 0 1 1 4 0 1 0 0 1 2 1 1 2 0]] # CCCCTGTCCATGGTA >>> (X.toarray()>0).astype(int) # the same but counting only once per sequence [[1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1] [0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0] [0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0]] >>> vectorizer.get_feature_names() # the columns(chunks) [u'aa', u'ac', u'ag', u'at', u'ca', u'cc', u'cg', u'ct', u'ga', u'gc', u'gg', u'gt', u'ta', u'tc', u'tg', u'tt']

Now you can sum along the columns, mask unnecessary values or any manipulations that you need to perform, for example:

 >>> X.sum(axis=0) [[1 4 1 3 2 5 4 3 3 2 3 2 3 2 3 1]]

Finally, to find out how many occurrences a given motive has, you must find the index of the corresponding motive / piece, and then evaluate in the previous amount:

 >>> index = vectorizer.vocabulary_.get('ag') # 'ag' is your motif 2 # this means third column

In your case, you will need to divide your list into two parts (positive and negative values) to enable the downgrade condition. I quickly checked the list from the DSM answer, and it takes about 3-4 seconds on my computer. If I use 12,000 lengths of 4000 sequences, it will take a little less than a minute.

EDIT: all code using OP data can be found here .

+3

elyase Nov 17 '13 at 2:04

source share

There are some weird things in your code.

What you call “permutations” is more like a Cartesian product that can be calculated using itertools.product .
Since Python is indexed with a null value, the first element of the line is index 0, so a comparison like i[2].find(sMotif) < 1 will return True if the line is right at the beginning, which seems a little strange.
Your OddsRatio, PValue and Enrichment calculations are inside the loop, but neither zeroing counts, nor print means that you calculate them cumulatively for each new line, but do nothing with this information.
You repeat i[2].find(sMotif) several times in the typical case. This result is not cached.

Assuming I understand the numbers you are trying to calculate, and I could be wrong because there are a few things you do, I don’t understand - I would flip the logic. Instead of sorting through each motive and trying to count it in each row, iterate over each row and see what is there. This will be approximately 7 * the number of lines instead of the number of motives * the number of lines.

For instance:

 import random from itertools import product from collections import defaultdict, Counter N = 12000 datalength = 400 listoflists = [[str(i), random.uniform(-1, 1), ''.join([random.choice('AGCT') for c in range(datalength)])] for i in range(N)] def chunk(seq, width): for i in range(len(seq)-width+1): yield seq[i:i+width] def count_motifs(datatriples, width=7): motif_counts_by_down = defaultdict(Counter) nonmotif_counts_by_down = defaultdict(Counter) all_motifs = set(''.join(p) for p in product('AGCT',repeat=width)) for symbol, value, sdata in datatriples: down = value < -0.5 # what did we see? motifs_seen = set(chunk(sdata, width)) # what didn't we see? motifs_not_seen = all_motifs - motifs_seen # accumulate these motif_counts_by_down[down].update(motifs_seen) nonmotif_counts_by_down[down].update(motifs_not_seen) return motif_counts_by_down, nonmotif_counts_by_down

(I reduced the length of the line just to make the conclusion faster; if the line is 10 times longer, the code takes 10 times longer.)

This happens on my slow laptop (after pasting some lines):

 >>> %time mot, nomot = count_motifs(listoflists, 7) CPU times: user 1min 50s, sys: 60 ms, total: 1min 50s Wall time: 1min 50s

Thus, I would have made about 20 minutes for the complete problem, which is not bad for such a small code. (We could speed up the motifs_not_seen part by doing arithmetic instead, but that would still give us a factor of two.)

In a much smaller case, where it is easier to see the conclusion:

 >>> mot, nomot = count_motifs(listoflists, 2) >>> mot defaultdict(<class 'collections.Counter'>, {False: Counter({'CG': 61, 'TC': 58, 'AT': 55, 'GT': 54, 'CA': 53, 'GA': 53, 'AC': 52, 'CT': 51, 'CC': 50, 'AG': 49, 'TA': 48, 'GC': 47, 'GG': 45, 'TG': 45, 'AA': 43, 'TT': 40}), True: Counter({'CT': 27, 'GT': 26, 'TC': 24, 'GC': 23, 'TA': 23, 'AC': 22, 'AG': 21, 'TG': 21, 'CC': 19, 'CG': 19, 'CA': 19, 'GG': 18, 'TT': 17, 'GA': 17, 'AA': 16, 'AT': 16})}) >>> nomot defaultdict(<class 'collections.Counter'>, {False: Counter({'TT': 31, 'AA': 28, 'GG': 26, 'TG': 26, 'GC': 24, 'TA': 23, 'AG': 22, 'CC': 21, 'CT': 20, 'AC': 19, 'GA': 18, 'CA': 18, 'GT': 17, 'AT': 16, 'TC': 13, 'CG': 10}), True: Counter({'AA': 13, 'AT': 13, 'GA': 12, 'TT': 12, 'GG': 11, 'CC': 10, 'CA': 10, 'CG': 10, 'AG': 8, 'TG': 8, 'AC': 7, 'GC': 6, 'TA': 6, 'TC': 5, 'GT': 3, 'CT': 2})})

+2

DSM Nov 16 '13 at 17:10

source share

basically your problem is sequence comparison. The search for motive in sequence is a fundamental issue in bioinformatics. I think you could look for some existing algorithms or packages. I searched for the keywords “motif match” on google and this is what I found on the first page: http://biowhat.ucsd.edu/homer/motif/ http://meme.nbcr.net/meme/doc/ mast.html http://www.biomedcentral.com/1471-2105/8/189 http://www.benoslab.pitt.edu/stamp/

0

Luyi Tian Nov 16 '13 at 18:47

source share

We wrote a tool called Sylamer. It counts occurrences of words of the same given length. By default, it calculates hypergeometric tests in the context of ranked genes or can only output numbers. He can do this for all possible words of a given length, but you can also specify a smaller list of words. It was written in C and is very optimized for quick processing. We made hypergeometric computation very fast by linking the GNU science library.

0

micans Nov 22 '13 at 11:54

source share

Karyo · Accepted Answer · 2013-11-22T07:37:31+0000

The problem was accurate fishing level testing. If I take the calculation for the values of P outside the for loop, then the calculation will be much faster.

Counting python number of presence and absence of substrings in a sequence list

More articles: