MKL Performance on Intel Phi

I have a procedure that makes several MKL calls to small matrices (50-100 x 1000 elements) to fit the model, which I then call for different models. In pseudo code:

double doModelFit(int model, ...) { ... while( !done ) { cblas_dgemm(...); cblas_dgemm(...); ... dgesv(...); ... } return result; } int main(int argc, char **argv) { ... c_start = 1; c_stop = nmodel; for(int c=c_start; c<c_stop; c++) { ... result = doModelFit(c, ...); ... } } 

Call the above version 1. Since the models are independent, I can use OpenMP streams to parallelize the model fit as follows (version 2):

 int main(int argc, char **argv) { ... int numthreads=omp_max_num_threads(); int c; #pragma omp parallel for private(c) for(int t=0; t<numthreads; t++) { // assuming nmodel divisible by numthreads... c_start = t*nmodel/numthreads+1; c_end = (t+1)*nmodel/numthreads; for(c=c_start; c<c_stop; c++) { ... result = doModelFit(c, ...); ... } } } 

When I run version 1 on the host machine, it takes ~ 11 seconds, and VTune reports poor parallelization with most of the downtime. Version 2 on the host machine takes ~ 5 seconds, and VTune reports a lot of parallelization (about 100% of the time is spent using 8 processors). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 take about 30 seconds to run on the command line on mic0. When I use VTune to profile it:

  • Version 1 takes about 30 seconds, and analysis of hotspot shows that most of the time was spent on __kmp_wait_sleep and __kmp_static_yield. Of the 7710s processor time, 5804s are spent in Spin Time.
  • Version 2 accepts fooooorrrreevvvver ... I will kill her after running a couple of minutes in VTune. Analysis of hotspot shows that out of 25254s of processor time, 21585s are spent on [vmlinux].

Can anyone shed some light on what is going on here and why am I getting such poor performance? I use the default value for OMP_NUM_THREADS and set KMP_AFFINITY = compact, granularity = fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm sure I'm making a rookie mistake.

Thanks Andrew

+6
source share
2 answers

The most likely reason for this behavior, given that most of the time is spent in the OS (vmlinux), is over-subscription caused by the nested parallel OpenMP scope inside the MKL implementation cblas_dgemm() and dgesv . For instance. see this example .

This version is supported and explained by Jim Dempsey on the Intel forum.

+1
source

How about using MKL: serial library? If you link the MKL library with a serial parameter, it does not generate OpenMP streams inside MKL itself. I think you can get better results than now.

0
source

Source: https://habr.com/ru/post/957215/


All Articles