How to use multiprocessing module in Python correctly?

Question

How to use multiprocessing module in Python correctly?

I have 110 PDF files from which I am trying to extract images. Once the images have been extracted, I would like to remove any duplicates and delete images smaller than 4 KB. My code for this is as follows:

def extract_images_from_file(pdf_file): file_name = os.path.splitext(os.path.basename(pdf_file))[0] call(["pdfimages", "-png", pdf_file, file_name]) os.remove(pdf_file) def dedup_images(): os.mkdir("unique_images") md5_library = [] images = glob("*.png") print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..." for image in images: if os.path.getsize(image) <= 4000: os.remove(image) else: m = md5.new() image_data = list(Image.open(image).getdata()) image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data]) m.update(image_string) md5_library.append([image, m.digest()]) headers = ['image_file', 'md5'] dat = pd.DataFrame(md5_library, columns=headers).sort(['md5']) dat.drop_duplicates(subset="md5", inplace=True) print "Extracting the unique images." unique_images = dat.image_file.tolist() for image in unique_images: old_file = image new_file = "unique_images\\" + image shutil.copy(old_file, new_file)

This process may take some time, so I started to understand multithreading. Feel free to interpret this as I say I have no idea what I'm doing. I thought the process would be easily parallelized with respect to image retrieval, but would not be deduplicated, since many I / O operations happen on a single file, and I don't know how to do it. So here is my attempt at a parallel process:

 if __name__ == '__main__': filepath = sys.argv[1] folder_name = os.getcwd() + "\\all_images\\" if not os.path.exists(folder_name): os.mkdir(folder_name) pdfs = glob("*.pdf") print "Copying all PDFs to the images folder..." for pdf in pdfs: shutil.copy(pdf, ".\\all_images\\") os.chdir("all_images") pool = Pool(processes=8) print "Extracting images from PDFs..." pool.map(extract_images_from_file, pdfs) print "Extracting unique images into a new folder..." dedup_images() print "All images have been extracted and deduped."

Everything seemed to be fine when you extracted the images, but then it all got confused. So here are my questions:

1) Will I set up the parallel process correctly?
2) Does it keep trying to use all 8 processors on dedup_images() ?
3) Is there something that I am missing and / or not doing right?

Thanks in advance!

EDIT This is what I mean by haywire. Errors begin with the following line:

 I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey of1pi0e l2ne1 1i'4mS auogbiepl o2fefinrlaee e N@ 'egSwmu abYipolor ekcn oaCm o Nupentwt y1Y -o18r16k11 8.C1po4nu gn3't4 y7 5160120821143 3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C 3o-u3l6d0n.'ptn go'p en image file 'Ia/ ON eEwr rYoorr:k CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o uiolmidalng2'eft rm ' ai gpceoo emfn iapl teN e1'w-S 8uY6bo2pr.okpe nnCgao' u Nnetwy Y1o0r2k8 1C4o u3n4t7y9 918181881134 3p4t7 536-1306211.3p npgt' 4-879.png' I/O Error: CoulId/nO' tE rorpoern: iCmoaugled nf'itl eo p'eub piomeangae fNielwe Y'oSrukb pCooeunnat yN e1w0 2Y8o1r 4k 3C4o7u9n9t8y8 811032 1p1t4 3o-i3l622f pt 1-863.png'

And then it becomes more readable with a few lines like this:

 I/O Error: Couldn't open image file 'pt 1-864.png' I/O Error: Couldn't open image file 'pt 1-865.png' I/O Error: Couldn't open image file 'pt 1-866.png' I/O Error: Couldn't open image file 'pt 1-867.png'

This is repeated for a while, moving back and forth between the distorted error text and the readable one.

Finally, he gets here:

 Deleting images smaller than 4KB and generating the MD5 hash values for all other images... Extracting unique images into a new folder...

which means the code takes a backup and continues the process. What could be wrong?

+5

python multithreading parallel-processing multiprocessing

brittenb Oct 2 '15 at 14:25

source share

2 answers

Yes, Pool.map accepts a function that takes 1 argument, and then a list, each element of which is passed as an argument to the first function.
No, because everything you wrote here is executed in the original process, with the exception of the body extract_images_from_file() . In addition, I will indicate that you are using 8 processes, not processors. If you have an Intel 8-core processor with Hyperthreading enabled, you can run 16 processes at the same time.
I'm fine, except if extract_images_from_file() throws an exception, it will destroy your entire Pool , which is probably not what you want. To prevent this, you can try this block.

What is the nature of the “haywire" you are dealing with? Can we see the text of the exception?

+3

user2993124 Oct 2 '15 at 15:56

source share

strubbly · Accepted Answer · 2015-10-02T22:32:36+0000

Your code is mostly beautiful.

Perverted text is all processes trying to write different versions of the I/O Error message, alternating with the console. The error message is generated by the pdfimages command, probably because when you run two at once, they conflict, perhaps over temporary files, or both use the same file name or something like that.

Try using a different image root for each individual PDF file.

How to use multiprocessing module in Python correctly?

More articles: