I am trying to turn my C project from serial to parallel programming. Although most of the code has now been redesigned from scratch for this purpose, random number generation is still at its core. Thus, poor performance of the random number generator (RNG) has a very bad effect on the overall performance of the program.
I wrote a few lines of code (see below) to show the problem I encountered, without much clarity.
The problem is this: every time the number of nt threads increases, performance becomes especially worse. On this workstation (linux kernel 2.6.33.4; gcc 4.4.4, intel quadcore CPU), a parallel for-loop takes about 10 times longer with nt = 4 than with nt = 1, regardless of the number of iterations n.
This situation seems to be described here , but the focus is mainly on fortran, a language that I know very little about, so I would really appreciate some help.
I tried to follow their idea of ββcreating different RNGs (with a different seed) to access each thread, but the performance is still very poor. In fact, this different seeding point for each thread also scares me, because I do not see how you can guarantee the quality of the generated numbers at the end (no correlations, etc.).
I already thought about abandoning GSL altogether and implementing a random generator algorithm (for example, Mersenne-Twister), but I suspect that I later encountered the same problem.
Thank you in advance for your answers and advice. Please ask something important that I may have forgotten to mention.
EDIT: Implemented fixes proposed by lucas1024 (pragma for-loop declaration) and JonathanDursi (sowing, setting "a" as a private variable). Performance is still very sluggish in multi-threaded mode.
EDIT 2: Implemented solution proposed by Jonathan Dursi (see comments).
#include <stdio.h> #include <stdlib.h> #include <time.h> #include <gsl/gsl_rng.h> #include <omp.h> double d_t (struct timespec t1, struct timespec t2){ return (t2.tv_sec-t1.tv_sec)+(double)(t2.tv_nsec-t1.tv_nsec)/1000000000.0; } int main (int argc, char *argv[]){ double a, b; int i,j,k; int n=atoi(argv[1]), seed=atoi(argv[2]), nt=atoi(argv[3]); printf("\nn\t= %d", n); printf("\nseed\t= %d", seed); printf("\nnt\t= %d", nt); struct timespec t1, t2, t3, t4; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1); //initialize gsl random number generator const gsl_rng_type *rng_t; gsl_rng **rng; gsl_rng_env_setup(); rng_t = gsl_rng_default; rng = (gsl_rng **) malloc(nt * sizeof(gsl_rng *)); #pragma omp parallel for num_threads(nt) for(i=0;i<nt;i++){ rng[i] = gsl_rng_alloc (rng_t); gsl_rng_set(rng[i],seed*i); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2); for (i=0;i<n;i++){ a = gsl_rng_uniform(rng[0]); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3); omp_set_num_threads(nt); #pragma omp parallel private(j,a) { j = omp_get_thread_num(); #pragma omp for for(i=0;i<n;i++){ a = gsl_rng_uniform(rng[j]); } } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t4); printf("\n\ninitializing:\t\tt1 = %f seconds", d_t(t1,t2)); printf("\nsequencial for loop:\tt2 = %f seconds", d_t(t2,t3)); printf("\nparalel for loop:\tt3 = %f seconds (%f * t2)", d_t(t3,t4), (double)d_t(t3,t4)/(double)d_t(t2,t3)); printf("\nnumber of threads:\tnt = %d\n", nt); //free random number generator for (i=0;i<nt;i++) gsl_rng_free(rng[i]); free(rng); return 0; }