To redo what Jeff said in the comments, you have a Xeon host with an Xeon Phi coprocessor connected. The current generation of Xeon Phi (Knight Corner) is only available as a coprocessor, and not as a separate Xeon Phi host (which should be available for the next generation with Knight Landing).
When you run your program without unloading from your Xeon host, it seems from this website that you can work with up to 16 threads. Please note that the speed of each of your cores is about 2.2 GHz.
When you run your program in native runtime on your Xeon Phi coprocessor, you can work with a lot more threads. The optimal number of threads to use depends on the Xeon Phi model you have (some work best with 56, others with 60). But note that each Xeon Phi core (approximately 1.2 GHz) is noticeably weaker than one Xeon core (approximately 2.2 GHz). The advantage of Xeon Phi multi-core technology is that you can run multiple cores.
The last very important thing to keep in mind is that the Xeon Phi has a SIMD instruction set for 512 bits. Thus, on the Xeon Phi coprocessor, you can support much better vector SIMD identification than on the host. In your case, I believe that your Xeon host has only a 256-bit SIMD image processing unit. Therefore, if you have not already done so, you can improve your performance (up to x16 if you are dealing with a single point) on your Xeon Phi, taking advantage of SIMD vectorization. Your Xeon host will only give up x8 performance. In order to run you on google trek, OpenMP 4.0 allows you to write things like #pragma omp simd to tell the compiler when to vectorize lower level loops throughout your code. If you really want to get the most out of Xeon Phi, adding SIMD vectorization is a must.
So, to directly answer your question: comparing the performance results between the Xeon host and the Xeon Phi coprocessor using the same number of cores is useless. We already know that every Xeon Phi core is slower than every Xeon core. You should compare the results using the maximum number of cores each of which allows (60 and 16, respectively), and with the maximum advantage of the vector processing block if you want a direct comparison.
source share