Speed โ€‹โ€‹up distance calculation in multidimensional Euclidean space in MySQL

I have the following table that stores image data:

images - id (int) - sample_1_1 (int) - sample_1_2 (int) - sample_1_3 (int) - sample_2_1 (int) - sample_2_2 (int) - sample_2_3 (int) - ... # Up until sample_25_3 

The challenge is to calculate the distance between the collected data. I am currently using a 75-dimensional (correct, 3 * 25 = 75) calculation of the Euclidean distance programmed as stored procedures in the database:

 CREATE DEFINER=`root`@`localhost` FUNCTION `distanceBetween`(compareId INT, toId INT) RETURNS double READS SQL DATA DETERMINISTIC BEGIN DECLARE distance DOUBLE; SELECT euclidDistance( i1.sample_1_1, i1.sample_1_2, i1.sample_1_3, i2.sample_1_1, i2.sample_1_2, i2.sample_1_3, ... ) INTO distance FROM images i1, (SELECT * FROM images WHERE id = toId) i2 WHERE i1.id = compareId; RETURN distance; END 

With another routine that calculates the actual distance between 2 75-dim. vectors:

 CREATE DEFINER=`root`@`localhost` FUNCTION `euclidDistance`( img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT, img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT, ... ) RETURNS double RETURN SQRT( quadDiff(img1_sample1_1, img2_sample1_1) + quadDiff(img1_sample1_2, img2_sample1_2) + quadDiff(img1_sample1_3, img2_sample1_3) + ... ) 

And another subroutine for calculating the squared difference between two values:

 CREATE DEFINER=`root`@`localhost` FUNCTION `quadDiff`(var1 INT, var2 INT) RETURNS int(11) RETURN POW(var1 - var2, 2) 

The functions themselves are perfectly accurate and return deterministic results that are mathematically and logically correct.

The problem arises when I want to get the "closest" images to a given image . This means that the images have the smallest distance to any image. For this, I use a different procedure:

 CREATE DEFINER=`root`@`localhost` PROCEDURE `getSimilarImages`(imageId INT, `limit` INT) BEGIN SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance FROM images i1, (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2 WHERE i1.id = imageId ORDER BY distance LIMIT 10; END 

The database currently contains about 30,000 images. This means that a CALL getSimilarImages(123, 10); takes about 12 seconds. It is too long for any application, be it a web application or an application.

So, I want to speed up the process. What are my options? Do you see any potential for optimizing the process of comparing images or calculating distances?

I was thinking of caching the result of the procedure, but I do not know how to do it. I could also compare each image with every other image as soon as new images are added, but this will add images to a very long process, which is also unacceptable.

I can provide additional information on setting up the system if this helps, but I appreciate any pointers you can provide. The current situation is not very good, and I really need to do something, because the image database will grow only every hour in the system.

+6
source share
3 answers

As you have discovered, your ORDER BY distance (a, b) operation computes ALL of these 75-dimensional distances, and it is not surprising that this takes a long time. It must calculate the whole lot so that it can perform the ORDER operation. Uch.

Here's an observation about Euclidean distance that may help you: the bounding box is a useful approximation. When you use GetSimilarImages, can you only include images that are within a certain threshold distance of the image you are using?

Say you only care about images within the rad distance of your image. You can then rework the GetSimilarImages request this way.

 PROCEDURE `getSimilarImages`(imageId INT, `limit` INT, rad INT) BEGIN SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance FROM images i1, (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2 WHERE i1.id = imageId AND i1.img_sample_1_1 BETWEEN i2.img_sample_1_1 - rad AND i2.img_sample_1_1 + rad AND i1.img_sample_1_2 BETWEEN i2.img_sample_1_2 - rad AND i2.img_sample_1_2 + rad AND i1.img_sample_1_3 BETWEEN i2.img_sample_1_3 - rad AND i2.img_sample_1_3 + rad ORDER BY distance LIMIT 10; END 

In this code example, I arbitrarily selected the first three of your 75 dimensions to use for the bounding box inclusion code (three BETWEEN clauses). For this optimization to work, you will need to create indexes for at least some of the dimensions used to include the bounding box.

There is nothing special about choosing three dimensions or choosing any specific dimensions. If you know from your data that some of your measurements differ better between your images, these are the ones you need to choose. You can choose as many measurements as you need. But of course there are overhead indexing.

+3
source

Build the UDF C function, or call the C function, which calls the GPU function.

+1
source

Optimization tips for this case:

Add id column index and duplicateImageId preferred over a clustured index .

Try to avoid several samples on a huge table of images .

You can increase performance by reducing the number of function calls when you call a function inside a function. All these function calls must be supported on a memory stack that consumes hardware resources.

A performance benchmark after every change you make in your code for optimization.

 CREATE PROCEDURE getSimilarImages(imageId INT unsigned, limit INT unsigned) BEGIN SELECT i2.id, i2.filename, euclidDistance( i1.sample_1_1, i1.sample_1_2, i1.sample_1_3, i2.sample_1_1, i2.sample_1_2, i2.sample_1_3, ... ) AS distance FROM images i1, (SELECT id, filename FROM images WHERE id <> imageId AND duplicateImageId IS NULL ) i2 WHERE i1.id = imageId ORDER BY distance LIMIT 10; END; CREATE FUNCTION euclidDistance( img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT, img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT, ... ) RETURNS double RETURN SQRT( POW(img1_sample1_1 - img2_sample1_1, 2) + POW(img1_sample1_2 - img2_sample1_2, 2) + POW(img1_sample1_3 - img2_sample1_3, 2) + ... ); 

Hope this helps.

0
source

Source: https://habr.com/ru/post/922024/


All Articles