Only very recently have the prospects of parallel programming caught my attention. Since then, I have used many parallel programming libraries. Perhaps my first stop was Intel Thread Building Blocks (TBB). But what often became a bottle was a mistake due to factors such as Round-Offs and the unpredictable behavior of these programs in different processor architectures. The following is a code snippet that calculates the Pearson correlation coefficient for two sets of values. It uses the most basic TBB parallel patterns - * parallel_for * and * parallel_reduce *:
Good! it worked perfectly on a windows machine with a Core i5 inside it. This gave me exactly the same values ββfor each parameter in the output with a parallel code factor faster than the serial code. Here is my conclusion :
OS : Windows 7 Ultimate 64-bit Processor : core i5
Serial Part ----------- Mean of a :1.81203e-05 Mean of b :1.0324e-05 Standard deviation of a :0.707107 Standard deviation of b :0.707107 Pearson Correlation Coefficient: 3.65091e-07 Parallel Part ------------- Mean of a :1.81203e-05 Mean of b :1.0324e-05 Standard deviation of a :0.707107 Standard deviation of b :0.707107 Pearson Correlation Coefficient: 3.65091e-07 Time Estimates -------------- Serial Time : 0.0204829 Seconds Parallel Time : 0.00939971 Seconds
What about other cars? If I say that this will work well, at least some of my friends will say, βWait, something suspicious.β There were slight differences in the answers (between those produced by parallel and serial code) on different machines, although parallel code was always faster than serial code. So what led to these differences? The findings we came across with this abnormal behavior were rounding errors that result from excessive parallelism and differences in processor architectures.
This leads to my questions:
- What are the precautions we need to take when we use parallel library processing in our codes to use multi-core processors?
- In what situations should we not use a parallel approach even though there are several processors available?
- What is the best we can do to avoid rounding errors? (Let me point out that I'm not talking about enforcing mutexes and barriers that a cap can someday put on the parallelism extension, but about simple programming tips that can be convenient at times)
I am very glad to see your suggestions on these issues. Please feel free to answer which is best for you if you have time limits.
Edit - here I added more results
OS : Linux Ubuntu 64 bit Processor : core i5
Serial Part ----------- Mean of a :1.81203e-05 Mean of b :1.0324e-05 Standard deviation of a :0.707107 Standard deviation of b :0.707107 Pearson Correlation Coefficient: 3.65091e-07 Parallel Part ------------- Mean of a :-0.000233041 Mean of b :0.00414375 Standard deviation of a :2.58428 Standard deviation of b :54.6333 Pearson Correlation Coefficient: -0.000538456 Time Estimates -------------- Serial Time :0.0161237 Seconds Parallel Time :0.0103125 Seconds
OS : Linux Fedora 64-bit Processor : core i3
Serial Part ----------- Mean of a :1.81203e-05 Mean of b :1.0324e-05 Standard deviation of a :0.707107 Standard deviation of b :0.707107 Pearson Correlation Coefficient: 3.65091e-07 Parallel Part ------------- Mean of a :-0.00197118 Mean of b :0.00124329 Standard deviation of a :0.707783 Standard deviation of b :0.703951 Pearson Correlation Coefficient: -0.129055 Time Estimates -------------- Serial Time :0.02257 Seconds Parallel Time :0.0107966 Seconds
Edit : after the change suggested by timday
OS : Linux Ubuntu 64 bit Processor : corei5
Serial Part ----------- Mean of a :1.81203e-05 Mean of b :1.0324e-05 Standard deviation of a :0.707107 Standard deviation of b :0.707107 Pearson Correlation Coefficient: 3.65091e-07 Parallel Part ------------- Mean of a :-0.000304446 Mean of b :0.00172593 Standard deviation of a :0.708465 Standard deviation of b :0.7039 Pearson Correlation Coefficient: -0.140716 Time Estimates -------------- Serial Time :0.0235391 Seconds Parallel time :0.00810775 Seconds
Best wishes.
Note1: I cannot guarantee that the above code snippet is correct. I think so.
Note2: This piece of code has also been tested on Linux boxes.
Note3: Various combinations of grain sizes and automatic separation parameters have been tested.