I run the following code on two different computers, the first has an NVIDIA Quadro FX 880M GPU, and the second has a Quadro FX 1000M (compiled in VS2010, opencv242,64bit, opencv was compiled from the source). The code I run is as follows:
int n = 1000; //number of iterations int t = CV_TM_CCORR_NORMED; //correlation type //reset GPU, print device info cv::gpu::printCudaDeviceInfo(cv::gpu::getDevice()); cv::gpu::resetDevice(); //read big image cv::Mat imgA = cv::imread("img.bmp" ,CV_LOAD_IMAGE_GRAYSCALE); //read small, template image cv::Mat imgB = cv::imread("tmplt.bmp",CV_LOAD_IMAGE_GRAYSCALE); //upload images to GPU cv::gpu::GpuMat imgA_GPU,imgB_GPU; imgA_GPU.upload(imgA); imgB_GPU.upload(imgB); cv::gpu::GpuMat imgC_GPU; //correlation results, computer in GPU cv::Mat imgC_CPU; //correlation results, computer in CPU //matchTemplate in GPU, print average time(mSec) size_t t1 = clock(); for(int i = 0;i!=n;++i) cv::gpu::matchTemplate(imgA_GPU , imgB_GPU, imgC_GPU , t); std::cout << "GPU: " << (double(clock())-t1)/CLOCKS_PER_SEC*1000.0/n <<std::endl; //matchTemplate in CPU, print average time(mSec) size_t t2 = clock(); for(int i = 0;i!=n;++i) cv:: matchTemplate(imgA , imgB , imgC_CPU , t); std::cout << "CPU: " << (double(clock())-t2)/CLOCKS_PER_SEC*1000.0/n <<std::endl; //download GPU image to host cv::Mat imgC_GPUhost; imgC_GPU.download(imgC_GPUhost); //convert images to 8U imgC_CPU.convertTo(imgC_CPU,CV_8U,255); imgC_GPUhost.convertTo(imgC_GPUhost,CV_8U,255); //!!!!!! imgC_GPUhost should be equal to imgC_CPU cv::Mat diff; cv::absdiff(imgC_CPU,imgC_GPUhost,diff); //expected: RESULTS DIFF: 0 std::cout << "RESULTS DIFF: " << cv::sum(diff).val[0] << std::endl; cv::imwrite("cor2.bmp",imgC_CPU); cv::imwrite("cor.bmp",imgC_GPUhost); char s; std::cin >> s;
There are two main things that puzzle me:
on Quadro FX 880M this function DOES NOT WORK: the output image of the GPU (imgC_GPU) is all zeros - it does not matter if the input type (8U or 32F) or the correlation method (ccor, ccoef, etc.), on the other hand, in Quadro FX 1000M I get consistent results between the CPU and GPU. How can this be and what do I need to do to make it work on Quadro FX 880M?
in the template corresponding to each pixel in the output image, can be calculated independently of other pixels - therefore, parallelism is easy, and the implementation of the graphical interface is ideal. How is this possible, even if you look at the average time (as in the code), the performance of the GPU is 3 times higher than that of the CPU? It was tested on both computers, while no other process was running in the background.
Ogad
source share