How to group images using "splashes"?

Question

How to group images using "splashes"?

I think this will require a little explanation, so please bear with me ...

I captured 2,000+ images in sub-bursts of 4-6 at a time. They were all dumped in one place, so I need to sort them. I need to sort them in turns, but EXIF data contains only a one-minute resolution. The bursts should be almost exactly the same, and the different bursts will be significantly different.

I need to look at each image, compare it with the next and see if it looks like. If it is too different, it must be from another package, it needs to go to the new folder along with any of the following images similar to it, and so on.

My thought is to add the absolute value of the difference between each pixel of the current image and the next. Once this amount reaches the threshold, this should mean that they come from different packages (I can do some testing to find out which threshold is good).

The biggest problem is how? Does PIL / Pillow support something like this? Is there a better way to see if one image is “basically” the same as the other?

I'm more interested in sorting them faster than using any particular technique, so other approaches are welcome.

... and that pretty much should be Python.

EDIT: Here are a couple of sample images that should be in the same folder: 001 002

These are two images from the following package and should go in a different folder: 003 004

+6

python image-processing

Matt Jun 17 '15 at 1:15

source share

4 answers

The OpenCV library is a good bet here if you want to perform content-based matching rather than time sorting suggested by the good people above. Check out this post on how to use the OpenCV library for different methods when matching image similarity: Checking images for similarity to OpenCV

There is a ton of SO questions in the same topic, so reading them will give you a better idea.

Based on the idea of the time above, when I draw only the time when your pictures were taken, this is the plot that I get:

Different colors represent different folders (they should have used a different color map for better visibility, but ok ...).

Just on the basis of these times, it seems that your intercluster time is noticeably noticeable than your intracluster time.

I also calculated some intra- and intercluster output indicators below:

 folder: ImageBurstsDataset/001 Total image files in folder: 6 Total intra-cluster time: 1.0 Average intra-cluster time: 0.166666666667 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/002 Total image files in folder: 7 Total intra-cluster time: 1.0 Average intra-cluster time: 0.142857142857 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/003 Total image files in folder: 6 Total intra-cluster time: 1.0 Average intra-cluster time: 0.166666666667 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/004 Total image files in folder: 6 Total intra-cluster time: 2.0 Average intra-cluster time: 0.333333333333 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/005 Total image files in folder: 6 Total intra-cluster time: 2.0 Average intra-cluster time: 0.333333333333 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/006 Total image files in folder: 6 Total intra-cluster time: 1.0 Average intra-cluster time: 0.166666666667 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/007 Total image files in folder: 6 Total intra-cluster time: 2.0 Average intra-cluster time: 0.333333333333 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/008 Total image files in folder: 5 Total intra-cluster time: 2.0 Average intra-cluster time: 0.4 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/009 Total image files in folder: 6 Total intra-cluster time: 1.0 Average intra-cluster time: 0.166666666667 Max: 1.0, Min: 0.0 folder: ImageBurstsDataset/010 Total image files in folder: 6 Total intra-cluster time: 2.0 Average intra-cluster time: 0.333333333333 Max: 1.0, Min: 0.0 Inter-cluster times: [10.0, 8.0, 7.0, 5.0, 6.0, 6.0, 5.0, 10.0, 6.0]

Disclaimer: I wrote this script in a hurry, I just need to go back and make sure that all of the extremes are correct. But for the rest ... the output that I draw from the dataset you uploaded is this:

inside the cluster, one image for no more than 1 second, except for the previous one.
The first image in the next cluster is at least 5 seconds from the last image of the previous cluster.

+3

Sandman Jun 17 '15 at 1:20

source share

How similar two images are an open research question. However, given that your images were shot quickly, using absolute differences is reasonable. Another possibility is to use correlation, for example, multiply pixel values and accept results that exceed the threshold.

The problem will be quick. Depending on your accuracy requirements, you can significantly calculate the images. It is possible that comparing the values of 100 or 1000 evenly distributed pixels --- the same pixels in each image --- will give you statistics that are accurate enough for your purposes.

+1

Darrell Jun 17 '15 at 1:26

source share

PIL can produce RGB image data, which theoretically can be used to compare images. To measure how close two images are, you probably have to calculate the difference of two images or even more errors by calculating statistical methods. You can get RGB data using

 import Image pic = Image.open('/path/to/file') rgbdata = pic.load() width, height = pic.size

You can only view data in terms of the RGB values of the i-th pixel in rgbdata [i, j].

Hope this helps.

[edit] This method only works if all the pictures are taken in one frame ... If the camera moves a little, it will not work.

If they say that the camera on a tripod (stationary) and the objects are moving, we can even track the movement of the object (where the difference in pixel values is higher).

Or you need to define tracking points, as is done in face recognition applications. (I'm not an image processing specialist, but have seen several applications that work this way)

Another way to compare two images is in the Fourier domain. But not sure how well this will work for you.

+1

Rohin kumar Jun 17 '15 at 1:50

source share

Matt · Accepted Answer · 2015-06-17T02:36:44+0000

Sorry, it turns out that the EXIF data was fine. It seems like there are good 10-15 seconds between bursts, so it should be very easy to tell when everything is over and the other starts.

PIL / Pillow has sufficient tools to view this creation date using:

from PIL.ExifTags import TAGS def get_exif(fn): ret = {} i = Image.open(fn) info = i._getexif() for tag, value in info.items(): decoded = TAGS.get(tag, tag) ret[decoded] = value return ret

... or something like that.

How to group images using "splashes"?

More articles: