Understanding the encoding of extracted functions

The encoding I am focusing on is the encoding for fish, as I have shown that I have the best results with my work. Therefore, I want to test the encoding on my extracted (SIFT) functions and check the system performance with or without encoding.

Instead of starting fresh, I found that vl_feat has a built-in coding library for fishermen, and they have a tutorial for this, as well as related here

Now I have already completed most of what is required, but what is actually encoded is confusing, for example, the manual explains that fish coding is performed using parameters obtained by GMM, such as [means, covariances, priors] and SIFT functions should be used here in GMM according to the tutorial:

Coding Fisher uses GMM to create a dictionary. To illustrate the construction of GMM, consider a series of two-dimensional data points. In practice, these items will be a collection of SIFT or other local image functions.

 numFeatures = 5000 ; dimension = 2 ; data = rand(dimension,numFeatures) ; numClusters = 30 ; [means, covariances, priors] = vl_gmm(data, numClusters); 

Then, as soon as I complete this step, should I encode another dataset? It bothers me. I have already used my extracted SIFT functions to generate parameters for GMM.

Then we create another random set of vectors that must be encoded using the Fisher Vector view and the GMM just obtained:

 encoding = vl_fisher(datatoBeEncoded, means, covariances, priors); 

So, here encoded is the end result, but WHAT encodes it? I need my SIFT functions, which I extracted from my images for encoding, but if I follow the manual used by GMM. If so, what is datatoBeEncoded ? Can I use SIFT feats again?

thanks

Update

@Shai

Thank you, but I believe that I should do something wrong. I don’t quite understand what you mean by "comparing images with yourself." I have 4 classes, from each class 1000 images. So I used the first 600 images from class 1 to find out the gmm parameters, and then use these parameters to encode fishing vectors

 numClusters = 128 ; [means, covariances, priors] = vl_gmm(data, numClusters); 

So each means, covariances has a size of 128 x 128 and priors of a size of 1 x 128

Now that I use them to encode a fisher vector on 400 images using the function

 encoding = vl_fisher(datatoBeEncoded, means, covariances, priors); 

the size of the encoding is very different from the size of 12000 x 1 . They cannot be compared with generated models.

I already had a system that worked on an unencrypted version of the dataset, and it worked well, but I wanted to see how coding would matter, theoretically the results should be improved.

I can add the code here if necessary, but it is for UBM-GMM, and the reason I got confused is that the training method you talked about is what I use for UBM.

If I simply encode test images, I cannot use them in the classifier due to size mismatch.

Perhaps I did not choose this correctly or made some kind of stupid mistake, is it possible to get a simple example with which I can understand the work.

thanks a lot

+5
source share
1 answer

You have two phases of the process:
(1) the training in which you use to study some of the statistical properties of your domain and
(2) testing where you use the studied view / models and apply them to new samples.

Accordingly, you must split your dataset into two β€œsplits” to learn GMM for Fisher encoding (training set) and another split to apply encoding to (test set).

Usually you choose a significant number of images that well represent your domain of interest (for example, if you are interested in people, you should consider a lot of photos of people indoors and outdoors, close-ups and group photos, etc.). You extract as many SIFT descriptors as you can from these training images and use them to study the model:

 numClusters = 30 ; [means, covariances, priors] = vl_gmm(TrainingData, numClusters); 

Once you save this model, you can apply it to new photos to encode them.

 encoding = vl_fisher(TestData, means, covariances, priors); 

Note that although TrainingData is generally very large and can be assembled from dozens (or even hundreds) of images, TestData can be significantly smaller and even be descriptors assembled from a single image.

+3
source

Source: https://habr.com/ru/post/1210057/