Trying to parallelize python algorithm using multithreading and avoiding GIL restrictions

Question

Trying to parallelize python algorithm using multithreading and avoiding GIL restrictions

I implement the algorithm in Python using Biopython . I have several alignments (sets of sequences of equal length) stored in FASTA files. Each alignment contains from 500 to 30,000 sec and each sequence is about 17,000 elements. Each sequence is stored as a Bio.SeqRecord.SeqRecord object (check the SeqRecord Object API Documentation for more information), which contains not only the sequence, but also some information about it. I read it from disk using Bio.AlignIO.read () (check the AlignIO module API documentation for more information), which returns a MultipleSeqAlignment object:

seqs = AlignIO.read(seqs_filename, 'fasta') len_seqs = seqs.get_alignment_length() stats = {'-': [0.0] * len_seqs, 'A': [0.0] * len_seqs, 'G': [0.0] * len_seqs, 'C': [0.0] * len_seqs, 'T': [0.0] * len_seqs}

I use this sketch for clarity: enter image description here

Since I want to paralyze alignment analysis, I assign a fragment to each available processor using the threading module (more on why I made this decision later):

 num_cpus = cpu_count() num_columns = ceil(len_seqs / float(num_cpus)) start_column = 0 threads = [] for cpu in range(0, num_cpus): section = (start_column, start_column + num_columns) threads.append(CI_Thread(seqs_type, seqs, section, stats)) threads[cpu].start() start_column += num_columns

The statistics variable is general, in which I store some information about the analysis (as you can see in the first code fragment). Since each processor is going to change the different positions of this common variable , I think that there is no need for access control or any synchronization primitive. The main reason I use threads instead of processes is because I want to avoid copying the entire MultipleSeqAlignment for each cpu . I did some research and found some stackoverflow posts about this.

I also read some information about the "scary" Python Global Interpreter ( GIL ) global blocker (I found some great info both in https://stackoverflow.com/a/4129/... and the programmers when exchanging stacks), but I'm still not 100 % sure my algorithm affects it . As far as I know, I load sequences into memory one at a time, doing IO on each iteration. This is why I consider it a good idea to use streams as pointed out in https://stackoverflow.com/a/3/5/855008/ .... I have already mentioned this before. The basic structure of the analysis (which is performed by each thread) looks something like this:

 for seq in seqs: num_column = start_column for column in seq.seq[start_column:end_column].upper(): # [...] try: stats[elem][num_column] = get_some_info(column) except TypeError: raise TypeError(('"stats" argument should be a ' 'dict of lists of int'))

I conducted several performance tests using the timeit module and the time command , using the -f "% e% M" arguments to check not only the elapsed real time (in seconds), but also the maximum size of the resident set for the process during its life cycle ( in kilobytes). It seems that the runtime using threads is the sequential implementation runtime divided by the number of threads . For the maximum resident set size, I still cannot find the template.

If you have any other suggestions for performance or clarity, I would really appreciate it.

Thanks in advance.

+6

python multithreading bioinformatics gil biopython

Francisco merino Dec 29 '15 at 13:28

source share

1 answer

whiterock · Answer 1 · 2014-12-29T15:06:14+0000

Threading will not help you if you want to run a processor-intensive algorithm for some data. Try to study the multiprocessing module, with which I was very successful when working on a project that made a special OCR in a 100 MB scanned image.

Consider this to solve a shared memory problem:

 from multiprocessing import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) # start 4 worker processes result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously print result.get(timeout=1) # prints "100" unless your computer is *very* slow print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"

Take a look at pool.imap() and pool.apply_async() .

Hope this helps.

Trying to parallelize python algorithm using multithreading and avoiding GIL restrictions

More articles: