Is there a sort procedure faster than qsort?

This is not an algorithmic issue, but an implementation issue.

I have a data structure that looks like this:

struct MyStruct { float val; float val2; int idx; } 

I look at an array of about 40 million elements and assign the fields "val" as an element, and the field "idx" as an index.

Then I call:

 MyStruct* theElements = new MyStruct[totalNum]; qsort(theElements, totalNum, sizeof(MyStruct), ValOrdering); 

and then, as soon as I fill val2, change the procedure with

 qsort(theElements, totalNum, sizeof(MyStruct), IndexOrdering); 

Where

 static int ValOrdering(const void* const v1, const void* const v2) { if (((struct MyStruct*) v1)->val < ((struct MyStruct*) v2)->val) return -1; if (((struct MyStruct*) v1)->val> ((struct MyStruct*) v2)->val) return 1; return 0; } 

and

 static int IndexOrdering(const void* const v1, const void* const v2) { return ((struct MyStruct*) v1)->idx- ((struct MyStruct*) v2)->idx; } 

This setting takes 4 seconds to complete both types. 4 seconds seems to be a long time for a kind of 40 million elements to take on the 3Ghz i5 processor; is there a faster approach? I use vs2010 with the Intel compiler (it has sorts, but not over the structures that I see).

Update . Using std :: sort razors about 0.4 seconds from runtime is called:

 std::sort(theElements, theElements + totalPixels, ValOrdering); std::sort(theElements, theElements + totalPixels, IndexOrdering); 

and

 bool GradientOrdering(const MyStruct& i, const MyStruct& j){ return i.val< j.val; } bool IndexOrdering(const MyStruct& i, const MyStruct& j){ return i.idx< j.idx; } 

adding the 'inline' keyword to predicates doesn't matter. Since I have, and the specification allows, a quad-core machine, I will conduct several types of multithreading.

Update 2 . Following @SirGeorge and @stark, I looked at one view made using pointer redirection:

 bool GradientOrdering(MyStruct* i, MyStruct* j){ return i->val< j->val; } bool IndexOrdering(MyStruct* i, MyStruct* j){ return i->idx< j->idx; } 

Despite the fact that there is only one sort request (to the GradientOrdering procedure), the resulting algorithm takes 5 seconds, 1 second longer than the qsort approach. It looks like now std :: sort wins.

Update 3 . Intel tbb::parallel_sort seems to be the winner by taking one sort of up to 0.5 s on my system while working (for example, 1.0s for both, which means that it scales pretty well from the original 4.0s for both). I tried to go with the parallel quirk suggested by Microsoft here , but since I already use tbb, and the syntax for parallel_sort is identical to the syntax for std::sort , I could use my earlier std::sort comparators to get everything finished.

I also used the @gbulmer suggestion (actually a bit-over-head implementation) that I already have the source indices, so instead of doing the second sort, I just need to assign a second array with an index from the first back in sorted order. I can get rid of the use of this memory because I only deploy on 64-bit machines with a memory capacity of at least 4 GB (it is good that these specifications work ahead of time); without this knowledge, a second grade would be required.

Sentence

@gbulmer gives the greatest speedup, but the original question asked about the quickest sorting. std::sort is the fastest single-threaded, parallel_sort is the fastest multi-threaded, but no one answered this question, so I give @gbulmer a check.

+6
source share
6 answers

The data set is huge compared to the cache, so it will be limited by the cache.

Using indirectness will make it worse, because there is a cache for pointers, and access to memory is done in a more random order, that is, a comparison is not with neighbors. The program works against any prefetching mechanisms in the CPU

Consider dividing a structure into two structures in two arrays.

As an experiment, compare passage 1 with passage 1, where the struct is only { float val; int idx; }; { float val; int idx; };

If it's cache and bandwidth, that should make a big difference.

If cache locality is a key issue, it might be worth considering multi-user merges or Shell sort; something to improve the terrain.

Try sorting the subsets of the cache size of the records, then do multi-threaded merges (maybe you should look at the specification of the processor cache manager to find out if this number of prefetch streams is trying to predict, again, reducing the size of the data sets by reducing the size of the structure transferred from RAM may be q winner.

How is the idx field generated? It looks like this is the original position in the array. Is this the index of the source record?

If so, just select the second array and copy first to the second:

 struct { float val; float val2; int idx } sortedByVal[40000000]; struct { float val; float val2 } sortedbyIdx[40000000]; for (int i=0; i<40000000; ++i) { sortedbyIdx[sortedByVal[i].idx].val = sortedByVal[i].val; sortedbyIdx[sortedByVal[i].idx].val2 = sortedByVal[i].val2; } 

There is no second grade. If so, combine the highlighting of the val2 value with this pass.

Edit

I was curious about the relative performance, so I wrote a program to compare the sorting functions of the "library" C, qsort, mergesort, heapsort, as well as comparing sorting with idx with copy to idx. It also re-sorts the sorted values ​​to get some information about it. This is also interesting. I have not implemented or tested Shell, which often types qsort in practice.

The program uses command line options to select which type of sort, and sort by idx or just copy. Code: http://pastebin.com/Ckc4ixNp

Jitter at runtime is pretty clear. I had to use the processor clock, do many runs and show the best results, but this is an β€œexercise for the reader."

I ran this on an old 2.2GHz MacBook Pro Intel Core 2 Duo. Depends on OS C for some time.

Timing (slightly reformatted):

 qsort(data, number-of-elements=40000000, element-size=12) Sorting by val - duration = 16.304194 Re-order to idx by copying - duration = 2.904821 Sort in-order data - duration = 2.013237 Total duration = 21.222251 User Time: 20.754574 System Time: 0.402959 mergesort(data, number-of-elements=40000000, element-size=12) Sorting by val - duration = 25.948651 Re-order to idx by copying - duration = 2.907766 Sort in-order data - duration = 0.593022 Total duration = 29.449438 User Time: 28.428954 System Time: 0.973349 heapsort(data, number-of-elements=40000000, element-size=12) Sorting by val - duration = 72.236463 Re-order to idx by copying - duration = 2.899309 Sort in-order data - duration = 28.619173 Total duration = 103.754945 User Time: 103.107129 System Time: 0.564034 

WARNING These are single runs. Getting reasonable statistics will require many runs.

The code in pastebin actually sorts the "reduced size", an 8-byte array. On the first pass, only val and idx are needed, and as the array is copied when val2 is added, the first array does not need val2. This optimization makes the sort functions copy a smaller structure, and also embed more structures in the cache, which are good. I was disappointed that this gives a few percent improvement in qsort. I interpret it like this: qsort quickly gets the sort of blocks to the size that fits in the cache.

The same reduced size strategy gives a more than 25% improvement in heapsort.

Timing for 8-byte structures without val2:

 qsort(data, number-of-elements=40000000, element-size=8) Sorting by val - duration = 16.087761 Re-order to idx by copying - duration = 2.858881 Sort in-order data - duration = 1.888554 Total duration = 20.835196 User Time: 20.417285 System Time: 0.402756 mergesort(data, number-of-elements=40000000, element-size=8) Sorting by val - duration = 22.590726 Re-order to idx by copying - duration = 2.860935 Sort in-order data - duration = 0.577589 Total duration = 26.029249 User Time: 25.234369 System Time: 0.779115 heapsort(data, number-of-elements=40000000, element-size=8) Sorting by val - duration = 52.835870 Re-order to idx by copying - duration = 2.858543 Sort in-order data - duration = 24.660178 Total duration = 80.354592 User Time: 79.696220 System Time: 0.549068 

WARNING These are single runs. Getting reasonable statistics will require many runs.

+3
source

Generally speaking, C ++ std::sort , located in the algorithm , will beat qsort because it allows the compiler to optimize the indirect call to the function pointer and makes it easier for the compiler to embed. However, this will only be a constant acceleration factor; qsort already uses a very fast sorting algorithm.

Note that if you decide to switch to std::sort , then your comparison functor will need to change. std::sort takes a simpler result than a comparison that returns bool , and std::qsort accepts a functor that returns -1, 0, or 1 depending on the input.

+14
source

When sorting by index, radix sorting can be faster than quicksort. You probably want to do this in a database that has a capacity of 2 (so you can use bitwise operations instead of a module).

+2
source

std::sort() should be more than 10% faster. However, you need two things:

  • Using a function pointer takes heroism from the compiler to discover that a function can be inlined. A functional object with an operator with a built-in function call is relatively simple. Line
  • In debug mode, std::sort() will not optimize the kernel, and qsort() will be optimized: try compiling in release mode.
+2
source

Now you are sorting the array of structures , which means that each swap in the array has at least two destinations (copying whole structures). You can try to sort the array of pointers in the structures, which will save you a lot of copies (just copy pointers), but you will use more memory. Another advantage of sorting an array of pointers is that you can have several of them (each of them is sorted differently) - again, more memory is required. However, additional indication of the pointer can be costly. You can also try using both approaches suggested here by others: std::qsort with an array of pointers - and see if there is acceleration in your case.

+1
source

All sorting algorithms are also known there. They are easy to implement. Control them.

Quick-Sort may not be the fastest in all cases, but it is quite effective on average. However, there are many 40 million records, the sorting of which in 3-4 seconds is not unheard of.

change

I will summarize my comments: it was proved that with the Turing model (written correctly here!) Sorting sorting algorithms are limited to Ξ© (n log n). Therefore, the complexity there is not much room for improvement, but the devil is in the details. To identify differences in the performance of algorithms equivalent to complexity, you need to compare them and look at the results.

If, however, you have additional knowledge about your data (for example, idx will be within a certain preset and a relatively small range), you can use algorithms that are not sorting and have complexity, you should still orient yourself to make sure that an improvement does occur for your data, but for a large volume the difference between Ξ© (n log n) and Ξ© (n) is likely to be noticeable. An example of such algorithms is sorting by bikes.

For a more complete analysis of the list and complexity - run here .

0
source

Source: https://habr.com/ru/post/911539/


All Articles