This is an open problem in image recognition. In addition to sliding windows, existing approaches include predicting the location of an object in an image as CNN output, predicting boundaries (class pixels that belong to the image border or not), etc. See, for example, this article and the links in it.
Also note that when using CNN using max-pooling, you can determine the positions of function detectors that contributed to the recognition of objects, and use them to determine the possible location area of ββthe object.
source share