Referring to the multi-threaded C ++ 11 example , I am trying to do multi-threading to calculate the result of the vector dot_product. The main idea here is that we split the vector into two or four parts and simultaneously calculate the partial_sum of each part. And do the summation after synchronizing the two taks streams. Here I use only CPU and RAM resources. Also, I'm trying to create a large vector to cover the flow schedule.
The problem is that two / four threads are even slower than one thread. But CPU usage is much higher when using two threads. Therefore, I believe that two physical cores are used by the program.
The platform and runtime results are shown below. Some said that the operating time is slightly higher. I'm not sure if performance is related to the OS. I have not tested the code on Linux, will anyone help me complete the test?
Any help would be greatly appreciated.
Here is my implementation:
#include<iostream>
#include<thread>
#include<vector>
#include<mutex>
#include<ctime>
#include<numeric>
#include<iterator>
using namespace std;
vector<int> boundary(int num, int parts)
{
vector<int >bounds;
int delta = num / parts;
int remainder = num % parts;
int prev = 0, next = 0;
bounds.push_back(prev);
for(int i = 0; i < parts; i++)
{
next = prev + delta;
if(i == parts - 1)
next = next + remainder;
bounds.push_back(next);
prev = next;
}
return bounds;
}
void dot_product(const vector<int>& v1, const vector<int>& v2, int& result, int L, int R, int tid)
{
int partial_sum = 0;
for(int i = L; i < R; i++)
partial_sum += v1[i] * v2[i];
result = partial_sum;
}
int main(int argc, char* argv[])
{
clock_t start, end;
int numOfElement = 500000000;
int numOfThread = 2;
int result[numOfThread] = {0};
vector<thread> threads;
vector<int> v1(numOfElement, 1), v2(numOfElement, 2);
vector<int> limits = boundary(numOfElement, numOfThread);
start = clock();
for(int i = 0; i < numOfThread; i++)
threads.push_back(thread(dot_product, ref(v1), ref(v2), ref(result[i]), limits[i], limits[i+1], i));
for(auto &t:threads)
t.join();
int sum = accumulate(result, result+numOfThread, 0);
end = clock();
cout << "results: " << sum << " time elapsed: "<< double(end - start) / CLOCKS_PER_SEC << endl;
return 0;
}
Platform:
OS: Win8 64bit
CPU: I3-3220 (2C4T)
RAM: 12G
IDE: Dev-C ++ 5.11, TDM-GCC 4.9.2 Release
Results:
1 Topic: 14.42 seconds, CPU: 60%, RAM: 3.82G (only the program, fixed)
2 Topics: 19.65 seconds, CPU: 82%, RAM: 3.82G (program only, fixed)
4 Topics: 22.33 seconds, CPU: 99%, RAM: 3.82G (program only, fixed)
Update:
, -O2 GCC , . , , , @yzt . , timming.
:
1 : 0.57
2 : 0.31
4 : 0.28
, 2 , , , I3 2 4 .