OpenMP and GSL RNG - performance - implementing 4 threads is 10 times slower than pure serial (quadcore CPU)

Question

OpenMP and GSL RNG - performance - implementing 4 threads is 10 times slower than pure serial (quadcore CPU)

I am trying to turn my C project from serial to parallel programming. Although most of the code has now been redesigned from scratch for this purpose, random number generation is still at its core. Thus, poor performance of the random number generator (RNG) has a very bad effect on the overall performance of the program.

I wrote a few lines of code (see below) to show the problem I encountered, without much clarity.

The problem is this: every time the number of nt threads increases, performance becomes especially worse. On this workstation (linux kernel 2.6.33.4; gcc 4.4.4, intel quadcore CPU), a parallel for-loop takes about 10 times longer with nt = 4 than with nt = 1, regardless of the number of iterations n.

This situation seems to be described here , but the focus is mainly on fortran, a language that I know very little about, so I would really appreciate some help.

I tried to follow their idea of creating different RNGs (with a different seed) to access each thread, but the performance is still very poor. In fact, this different seeding point for each thread also scares me, because I do not see how you can guarantee the quality of the generated numbers at the end (no correlations, etc.).

I already thought about abandoning GSL altogether and implementing a random generator algorithm (for example, Mersenne-Twister), but I suspect that I later encountered the same problem.

Thank you in advance for your answers and advice. Please ask something important that I may have forgotten to mention.

EDIT: Implemented fixes proposed by lucas1024 (pragma for-loop declaration) and JonathanDursi (sowing, setting "a" as a private variable). Performance is still very sluggish in multi-threaded mode.

EDIT 2: Implemented solution proposed by Jonathan Dursi (see comments).

#include <stdio.h> #include <stdlib.h> #include <time.h> #include <gsl/gsl_rng.h> #include <omp.h> double d_t (struct timespec t1, struct timespec t2){ return (t2.tv_sec-t1.tv_sec)+(double)(t2.tv_nsec-t1.tv_nsec)/1000000000.0; } int main (int argc, char *argv[]){ double a, b; int i,j,k; int n=atoi(argv[1]), seed=atoi(argv[2]), nt=atoi(argv[3]); printf("\nn\t= %d", n); printf("\nseed\t= %d", seed); printf("\nnt\t= %d", nt); struct timespec t1, t2, t3, t4; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1); //initialize gsl random number generator const gsl_rng_type *rng_t; gsl_rng **rng; gsl_rng_env_setup(); rng_t = gsl_rng_default; rng = (gsl_rng **) malloc(nt * sizeof(gsl_rng *)); #pragma omp parallel for num_threads(nt) for(i=0;i<nt;i++){ rng[i] = gsl_rng_alloc (rng_t); gsl_rng_set(rng[i],seed*i); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2); for (i=0;i<n;i++){ a = gsl_rng_uniform(rng[0]); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3); omp_set_num_threads(nt); #pragma omp parallel private(j,a) { j = omp_get_thread_num(); #pragma omp for for(i=0;i<n;i++){ a = gsl_rng_uniform(rng[j]); } } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t4); printf("\n\ninitializing:\t\tt1 = %f seconds", d_t(t1,t2)); printf("\nsequencial for loop:\tt2 = %f seconds", d_t(t2,t3)); printf("\nparalel for loop:\tt3 = %f seconds (%f * t2)", d_t(t3,t4), (double)d_t(t3,t4)/(double)d_t(t2,t3)); printf("\nnumber of threads:\tnt = %d\n", nt); //free random number generator for (i=0;i<nt;i++) gsl_rng_free(rng[i]); free(rng); return 0; }

+6

c parallel-processing openmp gsl

dd_rlwll Mar 28 '12 at 23:49

source share

1 answer

Lubo Antonov · Answer 1 · 2012-03-29T00:11:08+0000

The problem is the second line of #pragma omp. The first #pragma omp spawns 4 threads. After that, you should just say #pragma omp for - not #pragma omp parallel for.

With the current code, depending on your omp attachment settings, you create 4 x 4 threads that do the same job and gain access to the same data.

OpenMP and GSL RNG - performance - implementing 4 threads is 10 times slower than pure serial (quadcore CPU)

More articles: