Get maximum FLOPS for dense matrix multiplication with Xeon Phi Knights Landing

Question

Get maximum FLOPS for dense matrix multiplication with Xeon Phi Knights Landing

I recently started working with a 7250 Xeon Phi Knights Landing (KNL) 7250 computer ( http://ark.intel.com/products/94035/Intel-Xeon-Phi-Processor-7250-16GB-1_40-GHz-68-core ) .

It has 68 cores and an AVX 512. The base frequency is 1.4 GHz and the Turbo Frequency is 1.6 GHz. I do not know what the turbocharging frequency is for all cores, because, as a rule, the turbocharging frequency is quoted for only one core.

Each core for Knights Knights can perform two 8-cycle FMA operations per cycle. Since each FMA operation consists of two floating point operations, double floating point operations per cycle per core are 32.

Therefore, the maximum GFLOPS is 32*68*1.4 = 3046.4 DP GFLOPS .

For one core, the peak FLOPS is 32*1.6 = 51.2 DP GLOPS .

Dense matrix multiplication is one of the few operations that are actually capable of approaching peak outliers. Intel MKL library provides optimized dense matrix multiplication functions. On Sandy Bridge systems, I got over 97% of peak FLOPS with DGEMM. At Haswell, I got about 90% of the peak when I checked a few years ago, so it was clearly harder to get the peak with FMA at that time. However, with Knights Landing and MKL I get much less than 50% of the peak.

I modified the dgemm_example.c file in the MKL sample directory to calculate GFLOPS using 2.0*1E-9*n*n*n/time (see below).

I also tried export KMP_AFFINITY=scatter and export OMP_NUM_THREADS=68 , but that doesn't seem to matter much. However, KMP_AFFINITY=compact much slower and therefore OMP_NUM_THREADS=1 , so the default thread topology seems to be scattered all the same and the threads work.

The best GFLOPS I've seen are around 1301 GFLOPS, which is about 43% of the peak. For one thread, I saw 38 GFLOPS, which is about 74% of the peak. This suggests that MKL DGEMM is optimized for the AVX512, otherwise it will see less than 50%. On the other hand, for one thread, I think I should get 90% of the peak.

KNL memory can work in three modes (cached, flat and hybrid), which can be installed from the BIOS ( http://www.anandtech.com/show/9794/a-few-notes-on-intels-knights-landing- and-mcdram-modes-from-sc15 ). I do not know in what mode my (or rather, my work) KNL system works. Could this affect DGEMM?

My question is: why is the FLOPS from DGEMM so low, and what can I do to improve it? Perhaps I did not configure MKL optimally (I am using ICC 17.0).

 source /opt/intel/mkl/bin/mklvars.sh intel64 icc -O3 -mkl dgemm_example.c

Here is the code

 #define min(x,y) (((x) < (y)) ? (x) : (y)) #include <stdio.h> #include <stdlib.h> #include "mkl.h" #include "omp.h" int main() { double *A, *B, *C; int m, n, k, i, j; double alpha, beta; printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n" " Intel(R) MKL function dgemm, where A, B, and C are matrices and \n" " alpha and beta are double precision scalars\n\n"); m = 30000, k = 30000, n = 30000; printf (" Initializing data for matrix multiplication C=A*B for matrix \n" " A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n); alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A = (double *)mkl_malloc( m*k*sizeof( double ), 64 ); B = (double *)mkl_malloc( k*n*sizeof( double ), 64 ); C = (double *)mkl_malloc( m*n*sizeof( double ), 64 ); if (A == NULL || B == NULL || C == NULL) { printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); mkl_free(A); mkl_free(B); mkl_free(C); return 1; } printf (" Intializing matrix data \n\n"); for (i = 0; i < (m*k); i++) { A[i] = (double)(i+1); } for (i = 0; i < (k*n); i++) { B[i] = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C[i] = 0.0; } printf (" Computing matrix product using Intel(R) MKL dgemm function via CBLAS interface \n\n"); double dtime; dtime = -omp_get_wtime(); cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, k, B, n, beta, C, n); dtime += omp_get_wtime(); printf ("\n Computations completed.\n\n"); printf ("time %f\n", dtime); printf ("GFLOPS %f\n", 2.0*1E-9*m*n*k/dtime); printf (" Top left corner of matrix A: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(k,6); j++) { printf ("%12.0f", A[j+i*k]); } printf ("\n"); } printf ("\n Top left corner of matrix B: \n"); for (i=0; i<min(k,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.0f", B[j+i*n]); } printf ("\n"); } printf ("\n Top left corner of matrix C: \n"); for (i=0; i<min(m,6); i++) { for (j=0; j<min(n,6); j++) { printf ("%12.5G", C[j+i*n]); } printf ("\n"); } printf ("\n Deallocating memory \n\n"); mkl_free(A); mkl_free(B); mkl_free(C); printf (" Example completed. \n\n"); return 0; }

+6

x86 openmp icc intel-mkl xeon-phi

Z boson Dec 23 '16 at 9:47

source share

No one has answered this question yet.

See similar questions:

28

Degradation of the performance of matrix multiplication of single and two-point arrays on a multi-core machine

9

How can I programmatically find the processor frequency using C

or similar:

51