Approximately 600 GB of photos collected over 13 years are now stored on freebsd zfs / server.
Photos come from family computers, from several partial backups to various external USB drives, reconstructed images from disk disasters, from various photo-processing programs (iPhoto, Picassa, HP and many others :() in several deep subdirectories - in the near future = TERRIBLE MESS with many duplicates.
So in the first I did:
- I searched the tree for files of the same size (fast) and made the md5 checksum for them.
- duplicate images collected (same size + same md5 = duplicate)
It helped a lot, but there are still MUCH many duplicates:
- photos differ from each other only with exif / iptc data added by some photo management software, but the image is the same (or at least “looks the same” and has the same size).
- or is it just modified versions of the original image
- or they are "extended" versions of the originals, etc.
Now the questions are:
- How to find duplicates with g, the counter of only "pure image bytes" in JPG without exif / IPTC and similar meta-information? So, you want to filter out duplicate photos, which differs only by exif tags, but the image is the same. (therefore, the checksum of the files does not work, but the checksum of the images may ...). This (I hope) is not very difficult - but some direction is needed.
- Which perl module can extract "clean" image data from a jpg file, what can be used for comparison / checksum?
More difficult
- how to find “similar” images that only
- modified versions of originals
- "enchanted" versions of the originals (from some photo-processing programs)
- Is there already any algorithm available in the unix command form or perl module (XS?) That I can use to detect these special “duplicates”?
I can make complex BASH scripts and "+ -" :) know perl .. Can use FreeBSD / Linux utilities directly on the server, and can use OS X over the network (but working with 600 GB over LAN is not the fastest way) ...
My approximate idea:
- delete images only at the end of the workflow
use Image::ExifTool script to collect duplicate image data based on image creation date and camera model (possibly other exif data).- make a checksum of clean image data (or extract a histogram - the same images must have the same histogram) - are not sure about this
- use similarity detection to find duplicates based on resizing and photo enhancement - I don’t know how to do it ...
Any idea, help, any (software / algorithm) suggests how to make order in chaos?
Ps:
Here's almost the same question: Finding duplicate image files , but I already made the answer (md5). and is looking for more accurate checksums and image comparison algorithms.
jm666 source share