Suppose your cars are moving, you can try to evaluate the ground plane (road).
You can get an estimate of the ground descent plan by extracting functions (SURF, not SIFT, for speed), matching them over pairs of pairs and solving for homography using RANSAC, since the plane in 3d moves in accordance with the homography between the two camera frames.
Once you have a ground plane, you can identify cars by looking at clusters of pixels that do not move according to the estimated homography.
A more sophisticated approach would be to make Structure from Motion on the ground. It only assumes that it is stiff, and not that it is flat.
Update
I was wondering, can you talk about how you are going to look for clusters of pixels that do not move according to the homography score?
Sure. Say, I and K are two video frames, and H are the homography mapping functions in I for functions in K First, you deform I to K according to H , that is, you compute the distorted image Iw as Iw( [xy]' )=I( inv(H)[xy]' ) (roughly the Matlab notation). Then you look at the square or absolute image of the difference Diff=(Iw-K)*(Iw-K) . The content of the image, which moves in accordance with the homography H , should give slight differences (subject to constant illumination and exposure between images). An image that violates H , for example, moving vehicles, should be allocated.
For clustering groups of pixels with a high level of errors in Diff I would start with a simple threshold value ("every pixel difference in Diff greater than X matters", possibly using an adaptive threshold). The threshold image can be cleaned by morphological operations (dilatation, erosion) and grouped with connected components. This may be too simplistic, but it is easy to implement for the first attempt, and it should be fast. For something a more bizarre look at Clustering on Wikipedia . An interesting may be the 2D Gaussian model of the mixture ; when you initialize it with the discovery result from the previous frame, it should be pretty fast.
I experimented a bit with the two frames that you provided, and I have to say that I'm a little surprised how well this works. :-) Left image: The difference (color coding) between the two frames that you posted. Correct image: Difference between frames after comparing them with homography. The rest of the differences are clearly moving cars, and they are strong enough for a simple threshold.

Thinking about the approach you are currently using, maybe this is interesting, combining it with my suggestion:
- You can try to study and classify cars in difference image
D instead of the original image. This will mean learning how a moving car looks, and not what a car that can be more reliable looks like. - You can get rid of the expensive window search and run the classifier only in areas
D with a sufficiently high value.
Some additional notes:
- In theory, cars should even stand out if they donβt move, as they are not flat, but given the distance to the scene and the resolution of the camera, this effect may be too subtle.
- You can replace the extraction / reconciliation part of the function of my proposal with Optical Flow , if you wish. This boils down to identifying flow vectors that stick out from the sequential motion of the earth toward the frame. However, it may be subject to emissions in the optical stream. You can also try to get homography from flow vectors.
- This is important: No matter what method you use, once you find the cars in one frame, you should use this information to increase your search for these cars in sequential mode, which gives a higher probability of detection close to the old ones (Kalman filter and etc.). What is tracking!