The data set is huge compared to the cache, so it will be limited by the cache.
Using indirectness will make it worse, because there is a cache for pointers, and access to memory is done in a more random order, that is, a comparison is not with neighbors. The program works against any prefetching mechanisms in the CPU
Consider dividing a structure into two structures in two arrays.
As an experiment, compare passage 1 with passage 1, where the struct is only { float val; int idx; };
{ float val; int idx; };
If it's cache and bandwidth, that should make a big difference.
If cache locality is a key issue, it might be worth considering multi-user merges or Shell sort; something to improve the terrain.
Try sorting the subsets of the cache size of the records, then do multi-threaded merges (maybe you should look at the specification of the processor cache manager to find out if this number of prefetch streams is trying to predict, again, reducing the size of the data sets by reducing the size of the structure transferred from RAM may be q winner.
How is the idx field generated? It looks like this is the original position in the array. Is this the index of the source record?
If so, just select the second array and copy first to the second:
struct { float val; float val2; int idx } sortedByVal[40000000]; struct { float val; float val2 } sortedbyIdx[40000000]; for (int i=0; i<40000000; ++i) { sortedbyIdx[sortedByVal[i].idx].val = sortedByVal[i].val; sortedbyIdx[sortedByVal[i].idx].val2 = sortedByVal[i].val2; }
There is no second grade. If so, combine the highlighting of the val2 value with this pass.
Edit
I was curious about the relative performance, so I wrote a program to compare the sorting functions of the "library" C, qsort, mergesort, heapsort, as well as comparing sorting with idx with copy to idx. It also re-sorts the sorted values ββto get some information about it. This is also interesting. I have not implemented or tested Shell, which often types qsort in practice.
The program uses command line options to select which type of sort, and sort by idx or just copy. Code: http://pastebin.com/Ckc4ixNp
Jitter at runtime is pretty clear. I had to use the processor clock, do many runs and show the best results, but this is an βexercise for the reader."
I ran this on an old 2.2GHz MacBook Pro Intel Core 2 Duo. Depends on OS C for some time.
Timing (slightly reformatted):
qsort(data, number-of-elements=40000000, element-size=12) Sorting by val - duration = 16.304194 Re-order to idx by copying - duration = 2.904821 Sort in-order data - duration = 2.013237 Total duration = 21.222251 User Time: 20.754574 System Time: 0.402959 mergesort(data, number-of-elements=40000000, element-size=12) Sorting by val - duration = 25.948651 Re-order to idx by copying - duration = 2.907766 Sort in-order data - duration = 0.593022 Total duration = 29.449438 User Time: 28.428954 System Time: 0.973349 heapsort(data, number-of-elements=40000000, element-size=12) Sorting by val - duration = 72.236463 Re-order to idx by copying - duration = 2.899309 Sort in-order data - duration = 28.619173 Total duration = 103.754945 User Time: 103.107129 System Time: 0.564034
WARNING These are single runs. Getting reasonable statistics will require many runs.
The code in pastebin actually sorts the "reduced size", an 8-byte array. On the first pass, only val and idx are needed, and as the array is copied when val2 is added, the first array does not need val2. This optimization makes the sort functions copy a smaller structure, and also embed more structures in the cache, which are good. I was disappointed that this gives a few percent improvement in qsort. I interpret it like this: qsort quickly gets the sort of blocks to the size that fits in the cache.
The same reduced size strategy gives a more than 25% improvement in heapsort.
Timing for 8-byte structures without val2:
qsort(data, number-of-elements=40000000, element-size=8) Sorting by val - duration = 16.087761 Re-order to idx by copying - duration = 2.858881 Sort in-order data - duration = 1.888554 Total duration = 20.835196 User Time: 20.417285 System Time: 0.402756 mergesort(data, number-of-elements=40000000, element-size=8) Sorting by val - duration = 22.590726 Re-order to idx by copying - duration = 2.860935 Sort in-order data - duration = 0.577589 Total duration = 26.029249 User Time: 25.234369 System Time: 0.779115 heapsort(data, number-of-elements=40000000, element-size=8) Sorting by val - duration = 52.835870 Re-order to idx by copying - duration = 2.858543 Sort in-order data - duration = 24.660178 Total duration = 80.354592 User Time: 79.696220 System Time: 0.549068
WARNING These are single runs. Getting reasonable statistics will require many runs.