I am studying openmp using an example of calculating pi value by quadrature. In sequential order, I run the following C code:
double serial() {
double step;
double x,pi,sum = 0.0;
step = 1.0 / (double) num_steps;
for (int i = 0; i < num_steps; i++) {
x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x*x);
}
pi = step * sum;
return pi;
}
I compare this with the omp implementation, using parallel to shorten:
double SPMD_for_reduction() {
double step;
double pi,sum = 0.0;
step = 1.0 / (double) num_steps;
#pragma omp parallel for reduction (+:sum)
for (int i = 0; i < num_steps; i++) {
double x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x*x);
}
pi = step * sum;
return pi;
}
For num_steps = 1,000,000,000 and 6 threads in the omp case, I compile and time:
double start_time = omp_get_wtime();
serial();
double end_time = omp_get_wtime();
start_time = omp_get_wtime();
SPMD_for_reduction();
end_time = omp_get_wtime();
Using cc compiler optimizers should not exceed 4 s (serial) and .66s (omp). With the -O3 flag, the serial runtime drops to “.000001s,” and the omp runtime is basically unchanged. What's going on here? Are these vector instructions used, or is it bad code or a synchronization method? If it is a vectorization, why omp function is not used?
It may seem that my machine uses a modern 6-core Xeon processor.
Thanks!