Multiprocessor imap_unordered in python

I am making a program that reads several files and writes a summary of each file to the output file. The size of the output file is quite large, so storing it in memory is not a good idea. I am trying to develop a multiprocessor way to do this. So far, the easiest way I have been able to contact is:

pool = Pool(processes=4) it = pool.imap_unordered(do, glob.iglob(aglob)) for summary in it: writer.writerows(summary) 

do is a function that summarizes a file. writer - csv.writer object

But the truth is that I still do not understand multipprocessing.imap completely. Does this mean that 4 resumes are computed in parallel, and that when I read one of them, the fifth starts to be computed?

Is there a better way to do this?

Thanks.

+6
source share
1 answer

processes=4 means that multiprocessing starts a pool with four workflows and sends work items to them. Ideally, if you support it, i.e. You have four cores, or workers are not fully attached to the processor, 4 work items will be processed in parallel.

I do not know the implementation of multiprocessing, but I think that the do results will be cached internally even before you read them, i.e. The fifth element will be calculated as soon as any process is completed with the element from the first wave.

If the best way depends on the type of your data. How many files are there in total that require processing, how large are the summary objects, etc. If you have many files (say, more than 10 thousand), Their admission may be an option through

 it = pool.imap_unordered(do, glob.iglob(aglob), chunksize=100) 

Thus, the work item is not a single file, but 100 files, and the results are also reported in batches of 100. If you have many work items, then the pieces reduce the overhead of etching and pasting the result objects.

+4
source

Source: https://habr.com/ru/post/890208/


All Articles