I have 110 PDF files from which I am trying to extract images. Once the images have been extracted, I would like to remove any duplicates and delete images smaller than 4 KB. My code for this is as follows:
def extract_images_from_file(pdf_file): file_name = os.path.splitext(os.path.basename(pdf_file))[0] call(["pdfimages", "-png", pdf_file, file_name]) os.remove(pdf_file) def dedup_images(): os.mkdir("unique_images") md5_library = [] images = glob("*.png") print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..." for image in images: if os.path.getsize(image) <= 4000: os.remove(image) else: m = md5.new() image_data = list(Image.open(image).getdata()) image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data]) m.update(image_string) md5_library.append([image, m.digest()]) headers = ['image_file', 'md5'] dat = pd.DataFrame(md5_library, columns=headers).sort(['md5']) dat.drop_duplicates(subset="md5", inplace=True) print "Extracting the unique images." unique_images = dat.image_file.tolist() for image in unique_images: old_file = image new_file = "unique_images\\" + image shutil.copy(old_file, new_file)
This process may take some time, so I started to understand multithreading. Feel free to interpret this as I say I have no idea what I'm doing. I thought the process would be easily parallelized with respect to image retrieval, but would not be deduplicated, since many I / O operations happen on a single file, and I don't know how to do it. So here is my attempt at a parallel process:
if __name__ == '__main__': filepath = sys.argv[1] folder_name = os.getcwd() + "\\all_images\\" if not os.path.exists(folder_name): os.mkdir(folder_name) pdfs = glob("*.pdf") print "Copying all PDFs to the images folder..." for pdf in pdfs: shutil.copy(pdf, ".\\all_images\\") os.chdir("all_images") pool = Pool(processes=8) print "Extracting images from PDFs..." pool.map(extract_images_from_file, pdfs) print "Extracting unique images into a new folder..." dedup_images() print "All images have been extracted and deduped."
Everything seemed to be fine when you extracted the images, but then it all got confused. So here are my questions:
1) Will I set up the parallel process correctly?
2) Does it keep trying to use all 8 processors on dedup_images() ?
3) Is there something that I am missing and / or not doing right?
Thanks in advance!
EDIT This is what I mean by haywire. Errors begin with the following line:
I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey of1pi0e l2ne1 1i'4mS auogbiepl o2fefinrlaee e N@ 'egSwmu abYipolor ekcn oaCm o Nupentwt y1Y -o18r16k11 8.C1po4nu gn3't4 y7 5160120821143 3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C 3o-u3l6d0n.'ptn go'p en image file 'Ia/ ON eEwr rYoorr:k CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o uiolmidalng2'eft rm ' ai gpceoo emfn iapl teN e1'w-S 8uY6bo2pr.okpe nnCgao' u Nnetwy Y1o0r2k8 1C4o u3n4t7y9 918181881134 3p4t7 536-1306211.3p npgt' 4-879.png' I/O Error: CoulId/nO' tE rorpoern: iCmoaugled nf'itl eo p'eub piomeangae fNielwe Y'oSrukb pCooeunnat yN e1w0 2Y8o1r 4k 3C4o7u9n9t8y8 811032 1p1t4 3o-i3l622f pt 1-863.png'
And then it becomes more readable with a few lines like this:
I/O Error: Couldn't open image file 'pt 1-864.png' I/O Error: Couldn't open image file 'pt 1-865.png' I/O Error: Couldn't open image file 'pt 1-866.png' I/O Error: Couldn't open image file 'pt 1-867.png'
This is repeated for a while, moving back and forth between the distorted error text and the readable one.
Finally, he gets here:
Deleting images smaller than 4KB and generating the MD5 hash values for all other images... Extracting unique images into a new folder...
which means the code takes a backup and continues the process. What could be wrong?