Why in my case is multithreading slower than sequential programming?

Question

Why in my case is multithreading slower than sequential programming?

I am new to multithreading and try to learn it with a simple program that adds from 1 to n and returns the amount. In the sequential case, main calls the sumFrom1 function twice for n = 1e5 and 2e5; in multi-threaded cases, two threads are created using pthread_create , and two sums are computed in a separate thread. The multithreading version is much slower than the sequential version (see Results below). I run this on a platform with 12 processors, and there is no communication between threads.

Multithreaded:

 Thread 1 returns: 0 Thread 2 returns: 0 sum of 1..10000: 50005000 sum of 1..20000: 200010000 time: 156 seconds

Sequential:

 sum of 1..10000: 50005000 sum of 1..20000: 200010000 time: 56 seconds

When I add -O2 to compilation, the time of the multi-threaded version (9) is less than that of the serial version (11), but not as much as I expect. I always have the -O2 flag, but I’m interested in learning about the low multithreading speed in an unoptimized case. Should it be slower than the serial version? If not, what can I do to make it faster?

The code:

 #include <stdio.h> #include <pthread.h> #include <time.h> typedef struct my_struct { int n; int sum; }my_struct_t; void *sumFrom1(void* sit) { my_struct_t* local_sit = (my_struct_t*) sit; int i; int nsim = 500000; // Loops for consuming time int j; for(j = 0; j < nsim; j++) { local_sit->sum = 0; for(i = 0; i <= local_sit->n; i++) local_sit->sum += i; } } int main(int argc, char *argv[]) { pthread_t thread1; pthread_t thread2; my_struct_t si1; my_struct_t si2; int iret1; int iret2; time_t t1; time_t t2; si1.n = 10000; si2.n = 20000; if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version { t1 = time(0); iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1); iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2); pthread_join(thread1, NULL); pthread_join(thread2, NULL); t2 = time(0); printf("Thread 1 returns: %d\n",iret1); printf("Thread 2 returns: %d\n",iret2); printf("sum of 1..%d: %d\n", si1.n, si1.sum); printf("sum of 1..%d: %d\n", si2.n, si2.sum); printf("time: %d seconds", t2 - t1); } else // Use "./prog" to test the time of sequential version { t1 = time(0); sumFrom1((void*)&si1); sumFrom1((void*)&si2); t2 = time(0); printf("sum of 1..%d: %d\n", si1.n, si1.sum); printf("sum of 1..%d: %d\n", si2.n, si2.sum); printf("time: %d seconds", t2 - t1); } return 0; }

Update1:

After a little search on “false communication” (thanks, Martin James!), I think this is the main reason. There are (at least) two ways to fix this:

The first way is to insert a buffer zone between the two structures (thanks, @dasblinkenlight):

 my_struct_t si1; char memHolder[4096]; my_struct_t si2;

Without -O2, the time is reduced from ~ 156 to ~ 38 s.

The second way is to avoid frequent updates to sit->sum , which can be implemented using a temporary variable in sumFrom1 (as @Jens Gustedt replied):

 for(int sum = 0, j = 0; j < nsim; j++) { sum = 0; for(i = 0; i <= local_sit->n; i++) sum += i; } local_sit->sum = sum;

Without -O2 , it decreases from ~ 156 s to ~ 35 s or ~ 109 from time to time (it has two peaks! I don’t know why.). At -O2, the time remains ~ 8 s.

+6

performance c gcc multithreading pthreads

cogitovita Apr 11 '12 at 9:18

source share

1 answer

Jens gustedt · Accepted Answer · 2012-04-11T10:26:33+0000

Changing the code to

 typedef struct my_struct { size_t n; size_t sum; }my_struct_t; void *sumFrom1(void* sit) { my_struct_t* local_sit = sit; size_t nsim = 500000; // Loops for consuming time size_t n = local_sit->n; size_t sum = 0; for(size_t j = 0; j < nsim; j++) { for(size_t i = 0; i <= n; i++) sum += i; } local_sit->sum = sum; return 0; }

the phenomenon disappears. The problems you had:

using int , because the data type is completely wrong for such a test. Your numbers, where such an amount is full. Overflow of signed types is undefined behavior. You are lucky that he does not eat your lunch.
with restrictions and summing variables with indirectness, additional loads and stores buy you, which in the case of -O0 really executed as such, with all the consequences of a false exchange and the like.

Other errors have also been detected in your code:

Missing include for atoi
super-threaded cast in and out of void*
print time_t as int

Please compile your code with -Wall before publishing.

Why in my case is multithreading slower than sequential programming?

More articles: