Duplicate photo search by comparing only pure imagedata and image similarity?

Approximately 600 GB of photos collected over 13 years are now stored on freebsd zfs / server.

Photos come from family computers, from several partial backups to various external USB drives, reconstructed images from disk disasters, from various photo-processing programs (iPhoto, Picassa, HP and many others :() in several deep subdirectories - in the near future = TERRIBLE MESS with many duplicates.

So in the first I did:

  • I searched the tree for files of the same size (fast) and made the md5 checksum for them.
  • duplicate images collected (same size + same md5 = duplicate)

It helped a lot, but there are still MUCH many duplicates:

  • photos differ from each other only with exif / iptc data added by some photo management software, but the image is the same (or at least “looks the same” and has the same size).
  • or is it just modified versions of the original image
  • or they are "extended" versions of the originals, etc.

Now the questions are:

  • How to find duplicates with g, the counter of only "pure image bytes" in JPG without exif / IPTC and similar meta-information? So, you want to filter out duplicate photos, which differs only by exif tags, but the image is the same. (therefore, the checksum of the files does not work, but the checksum of the images may ...). This (I hope) is not very difficult - but some direction is needed.
  • Which perl module can extract "clean" image data from a jpg file, what can be used for comparison / checksum?

More difficult

  • how to find “similar” images that only
    • modified versions of originals
    • "enchanted" versions of the originals (from some photo-processing programs)
  • Is there already any algorithm available in the unix command form or perl module (XS?) That I can use to detect these special “duplicates”?

I can make complex BASH scripts and "+ -" :) know perl .. Can use FreeBSD / Linux utilities directly on the server, and can use OS X over the network (but working with 600 GB over LAN is not the fastest way) ...

My approximate idea:

  • delete images only at the end of the workflow
  • use Image::ExifTool script to collect duplicate image data based on image creation date and camera model (possibly other exif data).
  • make a checksum of clean image data (or extract a histogram - the same images must have the same histogram) - are not sure about this
  • use similarity detection to find duplicates based on resizing and photo enhancement - I don’t know how to do it ...

Any idea, help, any (software / algorithm) suggests how to make order in chaos?

Ps:

Here's almost the same question: Finding duplicate image files , but I already made the answer (md5). and is looking for more accurate checksums and image comparison algorithms.

+4
source share
3 answers

Have you looked at this article by Randal Schwartz? It uses a perl script with ImageMagick to compare resized (4x4 RGB grid) versions of images, which it then compares to mark “similar” images.

+2
source

You can remove exif data with mogrify -strip from the ImageMagick toolkit. Thus, for each image, you can copy it without exif, md5sum, and then compare md5sums.

When it comes to visually similar messages, you can, for example, use a comparison (also from the ImageMagick toolkit) and create a black and white diff map, as described here , then make a histogram difference and check if there is "enough" white so that it was different.

+2
source

Assuming you can work with a locally installed FS:

  • rmlint : the fastest tool I've ever used to find exact duplicates
  • findimagedupes : automate the entire ImageMagick image (like a Randal Schwartz script that I haven't tested?)
  • Finding similar and identical images using a perceptual hash is well underway (great help post)
  • dupeguru-pe (gui): a dedicated tool that works fast and works great.
  • geeqie (gui): I find it fast / excellent to finish the job using advanced deduplication options. In addition, you can create an ordered set of images so that the "simulated images are next to each other, allowing you to" flip "between them to see the changes.
+1
source

Source: https://habr.com/ru/post/1487613/


All Articles