Why are two / four threads even slower than a single thread making the inner_product vector in C ++ 11 multi-threaded colling?

Question

Why are two / four threads even slower than a single thread making the inner_product vector in C ++ 11 multi-threaded colling?

Referring to the multi-threaded C ++ 11 example , I am trying to do multi-threading to calculate the result of the vector dot_product. The main idea here is that we split the vector into two or four parts and simultaneously calculate the partial_sum of each part. And do the summation after synchronizing the two taks streams. Here I use only CPU and RAM resources. Also, I'm trying to create a large vector to cover the flow schedule.

The problem is that two / four threads are even slower than one thread. But CPU usage is much higher when using two threads. Therefore, I believe that two physical cores are used by the program.

The platform and runtime results are shown below. Some said that the operating time is slightly higher. I'm not sure if performance is related to the OS. I have not tested the code on Linux, will anyone help me complete the test?

Any help would be greatly appreciated.

Here is my implementation:

#include<iostream>
#include<thread>
#include<vector>
#include<mutex>
#include<ctime>
#include<numeric>
#include<iterator>

using namespace std;
vector<int> boundary(int num, int parts)
{
    vector<int >bounds;
    int delta = num / parts;
    int remainder = num % parts;
    int prev = 0, next = 0;
    bounds.push_back(prev);

    for(int i = 0; i < parts; i++)
    {
        next = prev + delta;
        if(i == parts - 1)
            next = next + remainder;
        bounds.push_back(next);
        prev = next;
    } 

    return bounds;
}

void dot_product(const vector<int>& v1, const vector<int>& v2, int& result, int L, int R, int tid)
{
    int partial_sum = 0;

    for(int i = L; i < R; i++)
        partial_sum += v1[i] * v2[i];

    //lock_guard<mutex> lock(barrier);
    //cout << "tid: " << tid<< endl;
    result = partial_sum;
}

int main(int argc, char* argv[])
{
    clock_t start, end;
    int numOfElement = 500000000;
    // Change the thread number here
    int numOfThread = 2;
    int result[numOfThread] = {0};
    vector<thread> threads;

    // Fill two vectors with some values 
    vector<int> v1(numOfElement, 1), v2(numOfElement, 2);

    // Split numOfElement into nr_threads parts
    vector<int> limits = boundary(numOfElement, numOfThread);

    start = clock();
    // Launch multi_threads:
    for(int i = 0; i < numOfThread; i++)
        threads.push_back(thread(dot_product, ref(v1), ref(v2), ref(result[i]), limits[i], limits[i+1], i)); 

    // Join the threads with the main thread    
    for(auto &t:threads)
        t.join();

    int sum = accumulate(result, result+numOfThread, 0);
    end = clock();  
    //cout << limits[0] <<" " << limits[1]<<" "<<limits[2]<<endl;
    cout << "results: " << sum << " time elapsed: "<< double(end - start) / CLOCKS_PER_SEC << endl;
    return 0;
}