I have the following table that stores image data:
images - id (int) - sample_1_1 (int) - sample_1_2 (int) - sample_1_3 (int) - sample_2_1 (int) - sample_2_2 (int) - sample_2_3 (int) - ... # Up until sample_25_3
The challenge is to calculate the distance between the collected data. I am currently using a 75-dimensional (correct, 3 * 25 = 75) calculation of the Euclidean distance programmed as stored procedures in the database:
CREATE DEFINER=`root`@`localhost` FUNCTION `distanceBetween`(compareId INT, toId INT) RETURNS double READS SQL DATA DETERMINISTIC BEGIN DECLARE distance DOUBLE; SELECT euclidDistance( i1.sample_1_1, i1.sample_1_2, i1.sample_1_3, i2.sample_1_1, i2.sample_1_2, i2.sample_1_3, ... ) INTO distance FROM images i1, (SELECT * FROM images WHERE id = toId) i2 WHERE i1.id = compareId; RETURN distance; END
With another routine that calculates the actual distance between 2 75-dim. vectors:
CREATE DEFINER=`root`@`localhost` FUNCTION `euclidDistance`( img1_sample1_1 INT, img1_sample1_2 INT, img1_sample1_3 INT, img2_sample1_1 INT, img2_sample1_2 INT, img2_sample1_3 INT, ... ) RETURNS double RETURN SQRT( quadDiff(img1_sample1_1, img2_sample1_1) + quadDiff(img1_sample1_2, img2_sample1_2) + quadDiff(img1_sample1_3, img2_sample1_3) + ... )
And another subroutine for calculating the squared difference between two values:
CREATE DEFINER=`root`@`localhost` FUNCTION `quadDiff`(var1 INT, var2 INT) RETURNS int(11) RETURN POW(var1 - var2, 2)
The functions themselves are perfectly accurate and return deterministic results that are mathematically and logically correct.
The problem arises when I want to get the "closest" images to a given image . This means that the images have the smallest distance to any image. For this, I use a different procedure:
CREATE DEFINER=`root`@`localhost` PROCEDURE `getSimilarImages`(imageId INT, `limit` INT) BEGIN SELECT i2.id, i2.filename, distanceBetween(i1.id, i2.id) distance FROM images i1, (SELECT * FROM images WHERE id != imageId AND duplicateImageId IS NULL) i2 WHERE i1.id = imageId ORDER BY distance LIMIT 10; END
The database currently contains about 30,000 images. This means that a CALL getSimilarImages(123, 10); takes about 12 seconds. It is too long for any application, be it a web application or an application.
So, I want to speed up the process. What are my options? Do you see any potential for optimizing the process of comparing images or calculating distances?
I was thinking of caching the result of the procedure, but I do not know how to do it. I could also compare each image with every other image as soon as new images are added, but this will add images to a very long process, which is also unacceptable.
I can provide additional information on setting up the system if this helps, but I appreciate any pointers you can provide. The current situation is not very good, and I really need to do something, because the image database will grow only every hour in the system.