Why is my n log (n) heapsort slower than my choice of n ^ 2

I implemented two algorithms for sorting items from highest to lowest.

The first takes the quadratic time for the real RAM model, and the second takes the time O (n log (n)). The second uses priority queues to get the reduction.

Below are the timings that are the output of the above program.

  • the first column is the size of a random array of integers
  • second column is the time in seconds for method O (n ^ 2)
  • the third column is the time in seconds for the O (n log (n)) method

    9600 1.92663 7.58865 9800 1.93705 7.67376 10000 2.08647 8.19094 

Despite this large difference in complexity, the third column is larger than the second for the array sizes in question. Why is this so? Is implementing a C ++ priority queue slow?

I executed this code on Windows 7, Visual Studio 2012 32-bit.

Here is the code

 #include "stdafx.h" #include <iostream> #include <iomanip> #include <cstdlib> #include <algorithm> #include <vector> #include <queue> #include <Windows.h> #include <assert.h> using namespace std; double time_slower_sort(vector<int>& a) { LARGE_INTEGER frequency, start,end; if (::QueryPerformanceFrequency(&frequency) == FALSE ) exit(0); if (::QueryPerformanceCounter(&start) == FALSE ) exit(0); for(size_t i=0 ; i < a.size() ; ++i) { vector<int>::iterator it = max_element( a.begin() + i ,a.end() ) ; int max_value = *it; *it = a[i]; a[i] = max_value; } if (::QueryPerformanceCounter(&end) == FALSE) exit(0); return static_cast<double>(end.QuadPart - start.QuadPart) / frequency.QuadPart; } double time_faster_sort(vector<int>& a) { LARGE_INTEGER frequency, start,end; if (::QueryPerformanceFrequency(&frequency) == FALSE ) exit(0); if (::QueryPerformanceCounter(&start) == FALSE ) exit(0); // Push into the priority queue. Logarithmic cost per insertion = > O (n log(n)) total insertion cost priority_queue<int> pq; for(size_t i=0 ; i<a.size() ; ++i) { pq.push(a[i]); } // Read of the elements from the priority queue in order of priority // logarithmic reading cost per read => O(n log(n)) reading cost for entire vector for(size_t i=0 ; i<a.size() ; ++i) { a[i] = pq.top(); pq.pop(); } if (::QueryPerformanceCounter(&end) == FALSE) exit(0); return static_cast<double>(end.QuadPart - start.QuadPart) / frequency.QuadPart; } int main(int argc, char** argv) { // Iterate over vectors of different sizes and try out the two different variants for(size_t N=1000; N<=10000 ; N += 100 ) { // initialize two vectors with identical random elements vector<int> a(N),b(N); // initialize with random elements for(size_t i=0 ; i<N ; ++i) { a[i] = rand() % 1000; b[i] = a[i]; } // Sort the two different variants and time them cout << N << " " << time_slower_sort(a) << "\t\t" << time_faster_sort(b) << endl; // Sanity check for(size_t i=0 ; i<=N-2 ; ++i) { assert(a[i] == b[i]); // both should return the same answer assert(a[i] >= a[i+1]); // else not sorted } } return 0; } 
+5
source share
4 answers

I think the problem is indeed more subtle than expected. In your O (N ^ 2) solution, you do not make a selection, the algorithm works in place, look for the largest one and swap the current position. Good.

But in the version priority_queue O (N log N) ( priority_queue in internal, it has std::vector by default for storing state). This vector when you push_back element by element, sometimes it must grow (and this happens), but this is the time when you do not lose in version O (N ^ 2) . If you make the following small change in initializing priority_queue :

priority_queue<int> pq(a.begin(), a.end()); instead of for loop

Time O (N log N) exceeded O (N ^ 2), as it should be pretty. The proposed change still has a distribution in the priority_queue version, but this is only once (you save a lot of allocation for large vector sizes, and allocation is one of the important time-consuming operations) and, possibly, initialization (in O (N) can take advantage of the full priority_queue state, I don’t know if the STL really does this).

Sample code (to compile and run):

 #include <iostream> #include <iomanip> #include <cstdlib> #include <algorithm> #include <vector> #include <queue> #include <Windows.h> #include <assert.h> using namespace std; double time_slower_sort(vector<int>& a) { LARGE_INTEGER frequency, start, end; if (::QueryPerformanceFrequency(&frequency) == FALSE) exit(0); if (::QueryPerformanceCounter(&start) == FALSE) exit(0); for (size_t i = 0; i < a.size(); ++i) { vector<int>::iterator it = max_element(a.begin() + i, a.end()); int max_value = *it; *it = a[i]; a[i] = max_value; } if (::QueryPerformanceCounter(&end) == FALSE) exit(0); return static_cast<double>(end.QuadPart - start.QuadPart) / frequency.QuadPart; } double time_faster_sort(vector<int>& a) { LARGE_INTEGER frequency, start, end; if (::QueryPerformanceFrequency(&frequency) == FALSE) exit(0); if (::QueryPerformanceCounter(&start) == FALSE) exit(0); // Push into the priority queue. Logarithmic cost per insertion = > O (n // log(n)) total insertion cost priority_queue<int> pq(a.begin(), a.end()); // <----- THE ONLY CHANGE IS HERE // Read of the elements from the priority queue in order of priority // logarithmic reading cost per read => O(n log(n)) reading cost for entire // vector for (size_t i = 0; i < a.size(); ++i) { a[i] = pq.top(); pq.pop(); } if (::QueryPerformanceCounter(&end) == FALSE) exit(0); return static_cast<double>(end.QuadPart - start.QuadPart) / frequency.QuadPart; } int main(int argc, char** argv) { // Iterate over vectors of different sizes and try out the two different // variants for (size_t N = 1000; N <= 10000; N += 100) { // initialize two vectors with identical random elements vector<int> a(N), b(N); // initialize with random elements for (size_t i = 0; i < N; ++i) { a[i] = rand() % 1000; b[i] = a[i]; } // Sort the two different variants and time them cout << N << " " << time_slower_sort(a) << "\t\t" << time_faster_sort(b) << endl; // Sanity check for (size_t i = 0; i <= N - 2; ++i) { assert(a[i] == b[i]); // both should return the same answer assert(a[i] >= a[i + 1]); // else not sorted } } return 0; } 

In my PC (Core 2 Duo 6300), the output is:

 1100 0.000753738 0.000110263 1200 0.000883201 0.000115749 1300 0.00103077 0.000124526 1400 0.00126994 0.000250698 ... 9500 0.0497966 0.00114377 9600 0.051173 0.00123429 9700 0.052551 0.00115804 9800 0.0533245 0.00117614 9900 0.0555007 0.00119205 10000 0.0552341 0.00120466 
+3
source

The 3rd column is larger than the second for the considered array sizes.

The designation "Big O" tells you how time increases with the size of the input.

Your time (or should be)

 A + B*N^2 for the quadratic case, C + D*N*LOG(N) for the linearithmic case. 

But it is quite possible that C is much larger than A, which leads to a higher execution time for linear-limit code , when N is quite small .

What makes linearity interesting is that if your input has increased from 9600 to 19200 (doubling), the runtime should be about a quarter, and for a quadratic algorithm, about eight seconds, while the algorithm with a linear label should be slightly larger than double its execution time.

Thus, the runtime coefficient will go from 2: 8 to 8:16, i.e. the quadratic algorithm is now only twice as fast.

Double the size of the input again and 8:16 will become 32:32; these two algorithms are equally fast when they encounter an input of about 40,000.

When deciding on the size of the input file, 80,000, the ratio is the opposite: four times 32 is equal to 128, while two times 32 is only 64. 128: 64 means that the line-labeled algorithm is now twice as fast as the other.

You should run tests with very different sizes, possibly N, 2 * N and 4 * N, in order to better evaluate your constants A, B, C and D.

It all boils down to not blindly relying on Big O's classification. Use it if you expect your input to become big; but for small inputs it is entirely possible that a less scalable algorithm is more efficient.

For instance, here you see that for smaller input sizes, a faster algorithm is one that works in exponential time, which is hundreds of times faster than the logarithmic one. But as soon as the input size increases by nine, the time of the exponential algorithm grows, and the other does not.

Perhaps you even decide to implement both versions of the algorithm and use one or the other depending on the size of the input. There are some recursive algorithms that do just that, and switch to iterative implementations for the latest iterations. In the case shown, you can implement the best algorithm for each size range; but the best compromise comes with only two algorithms, quadratic to N = 15 and then switches to logarithmic.

I found here a link to Introsort, which

is a sorting algorithm that initially uses Quicksort, but switches to Heapsort when the recursion depth exceeds a level based on the logarithm of the number of sorted elements and uses Insertion to sort for small cases due to its good locality, i.e. when the data is most likely in memory and easily referenced.

In the above case, Insertion sorting uses memory locality, which means that its constant B is very small; a recursive algorithm is likely to have higher costs and significant C value. Therefore, for smaller datasets, more compact algorithms succeed, even if their Big O classification is worse.

+12
source

You have an O (N ^ 2) algorithm that is 4 times faster than O (N log N). Or at least you think you are doing.

It would be obvious to check your assumption. There are not many that you can conclude from sizes 9600, 9800 and 10000. Try sizes 1000, 2000, 4000, 8000, 16000, 32000. The first algorithm increases the time by 4 times each time, how should it? Does the second algorithm increase the time several times more than twice, as it should?

If yes, then O (N ^ 2) and O (N log N) look correct, and in the second - massive constant factors. If not, then your assumption about the speed of execution is incorrect, and you begin an investigation of the reasons. O (N log N), taken 4 times longer than O (N * N) at N = 10000, will be very unusual and looks very suspicious.

+3
source

Visual Studio should have excessive overhead for the non-optimized / debugging level of std:: , in particular the priority queue class. Mark the comment by @msandifords.

I tested your program with g ++, first no optimizations.

 9800 1.42229 0.014159 9900 1.45233 0.014341 10000 1.48106 0.014606 

Please note that my vector times are close to yours. On the other hand, priority queue times are shorter. This would suggest a debugging and very slow implementation of the priority queue and, as such, would greatly contribute to the constant mentioned by the hot commentary on the comments.

Then with -O3, full optimization (Close to your Release mode).

 1000 0.000837 7.4e-05 9800 0.077041 0.000754 9900 0.078601 0.000762 10000 0.080205 0.000771 

Now, to make sure this is reasonable, you can use a simple formula for complexity.

 time = k * N * N; // 0.0008s k = 8E-10 

Calculate for N = 10000

 time = k * 10000 * 10000 // Which conveniently gives time = 0.08 

Completely corroded result corresponding to O (NΒ²) and good implementation. The same can be done for part O (NlogN).

+2
source

Source: https://habr.com/ru/post/1201602/


All Articles