Cost / comparison function to determine the center of an object based on detected characteristics

I wrote an object tracker that will try to detect and track a moving object in the recorded video. To maximize detection speed, my algorithm uses many detection and tracking algorithms (cascade, foreground and particle tracker). Each tracking algorithm returns a point of interest me, which may be part of the object I'm trying to track. Assume (for simplicity of this example) that my object is a rectangle and that three tracking algorithms return points 1 , 2 and 3 :

Step0

Depending on the ratio / distance of these three points, you can calculate the center of gravity (blue X in the image above) of the tracked object. Therefore, for each frame, I could come up with a good estimate of the center of gravity. However, an object can move from one frame to the next:

Step1

In this example, I just rotated the source object. My algorithm will give me three new points of interest: 1' , 2' and 3' . I could again calculate the center of gravity based on these three new points, but I would throw away the important information that I acquired from the previous frame: based on points 1 , 2 and 3 I already know something about the relationship of these points and, thus , by combining information from 1 , 2 and 3 and 1' , 2' and 3' I would have to get a more accurate estimate of the center of gravity.

In addition, the following frame can give a fourth data point:

Step2

This is what I would like to do (but I don't know how):

based on the individual points (and their relationship to each other) that are returned from different tracking algorithms, I want to create a localization map tracked object. Intuitively, I feel that I need to come up with A) an identification function that will identify individual points through frames and B) some kind of cost function that will determine how similar tracking points (and relations / distance between them) are located from frame to frame but I canโ€™t figure out how to implement this. Alternatively, some type of point-based map building may occur. But then again, I donโ€™t know how to approach this. Any advice (and sample code) is much appreciated!

EDIT1 a simple particle filter may also work, but again I don't know how to determine the cost function. The particle filter for tracking a specific color is easy to program: for each pixel you calculate the difference between the color of the target and the color of the pixel. But how would I do the same to assess the relationship between tracked points?

EDIT2 intuitively I feel that Kalman filters can also help with the prediction step. See Slides 24 to 32 of this pdf . Or am I misled?

+4
source share
4 answers

What I think you're trying to do is create a state space that can be applied to the filtering process, such as the Extended Kalman Filter . This is a useful structure when you have several observations in each frame and you are trying to evaluate or measure something indicated by these observations.

To determine the similarity of the tracked points, you can perform simple frame-to-frame pattern matching for small areas around the points. One way to do this is to extract the NxN region (say, 7x7 ) around point a in frame n and specify a' in frame n+1 , and then normalized cross-correlation between the selected areas. This will give you a reasonable estimate of how similar the patches are. If the patches do not look like you probably lost that point.

+1
source

There is a huge literature on these and related problems since the 80s. Try to find the "optical stream" algorithms. The input for such algorithms is two consecutive frames of the same scene. The output signal is a vector field, one vector per pixel in the second image, which shows which direction and speed the function is in this field. This presentation is a pretty nice summary.

The good thing about optical flow is that many of the algorithms for this are perfectly parallelized and displayed on your favorite GPU graphics card, so they can work in real time. Consider imposing an ESPN.

+1
source

For me, to determine who is who in each frame, you will need to use a larger dimension. For example, if you want to know which point is between two frames (given that your extracted point is the same), you will have to create vectors or a simplex, and then display the organization between your points (for example, angle values).

The main problem is that combinations increase with the point number. If your camera is a fixed point, you can use the background as a link to display the rotation and translations of objects, I mean the construction vectors between the focal points of interest and objects to clearly identify them. I hope that help will go forward.

+1
source

I would recommend referring to a split difference filter (DDF), which is similar to the extended Kalman filter (EKF), but does not require an approximate model of your system dynamics (which you may not have). Basically, the DDF approximates the derivatives used in the EKF using the difference equation. There are a lot of messages about this on the Internet, but I do not know if you have access to them, so I did not link them here. If you work at a university or company that has access to online journals (for example, IEEE Explore), just google a โ€œsplit-difference filterโ€ and check out some of the docs.

+1
source

Source: https://habr.com/ru/post/1491607/


All Articles