Pushing Radix Sort (and python) to its limits

I was very upset that many of the python radix implementations are there on the Internet.

They consistently use a radius of 10 and get the digits of the numbers they iterate over, dividing by the power of 10 or taking log10 numbers. This is incredibly inefficient as log10 is not a particularly fast operation compared to a bit shift, which is almost 100 times faster!

In a much more efficient implementation, a radius of 256 is used and a byte number is sorted. This allows you to do all the "byte processing" using ridiculously fast bit operators. Unfortunately, it seems that absolutely no one has implemented radix sorting in python, which uses bitwise operators instead of logarithms.

So, I took matters into my own hands and came up with this beast, which works at about half the speed of sorting by small arrays and works almost as fast on large ones (for example, len about 10,000,000):

 import itertools def radix_sort(unsorted): "Fast implementation of radix sort for any size num." maximum, minimum = max(unsorted), min(unsorted) max_bits = maximum.bit_length() highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1 min_bits = minimum.bit_length() lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1 sorted_list = unsorted for offset in xrange(lowest_byte, highest_byte): sorted_list = radix_sort_offset(sorted_list, offset) return sorted_list def radix_sort_offset(unsorted, offset): "Helper function for radix sort, sorts each offset." byte_check = (0xFF << offset*8) buckets = [[] for _ in xrange(256)] for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) return list(itertools.chain.from_iterable(buckets)) 

This version of radix collation works by determining which bytes it should sort by (if you pass it only integers below 256, it will sort only one byte, etc.) and then sort each byte from LSB by throwing them into buckets to just put the buckets together. Repeat this for each byte to be sorted, and you have a nice sorted array in O (n) time.

However, it is not as fast as it could be, and I would like to do it faster before writing about it as a better radis than about all other versions of radix.

Running cProfile on this tells me that a lot of time is wasted on the append method for lists, which makes me think that this block:

  for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) 

in radix_sort_offset eats a lot of time. This is also a block that, if you really look at it, does 90% of the work for the whole view. This code looks as if it could be numpy -ized, which, I think, will lead to a significant increase in performance. Unfortunately, I am not very good with numpy more complex functions, so I could not figure it out. Help would be greatly appreciated.

I am currently using itertools.chain.from_iterable to align buckets , but if anyone has a faster suggestion, I am sure this will help too.

Initially, I had a get_byte function that returned the nth byte of a number, but nesting the code gave me a huge speedup, so I did it.

Any other implementation comments or ways to squeeze more performance are also welcome. I want to hear anything and everything that you have.

+6
source share
3 answers

You already understood that

 for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) 

where most of the time goes is good; -)

There are two standard tricks for speeding up things like those related to moving invariants from loops:

  • Calculate "offset * 8" outside the loop. Store it in a local variable. Save the multiplication for iteration.
  • Add bucketappender = [bucket.append for bucket in buckets] outside the loop. Saves the method search for iteration.

Combine them and the loop looks like this:

 for num in unsorted: bucketappender[(num & byte_check) >> ofs8](num) 

Folding it into a single statement also saves a couple of local vrbl storage / fetch codes for each iteration.

But, at a higher level, the standard way to speed up radix sorting is to use a larger radius. What magic is about 256? Nothing, moreover, it is convenient for bit-shift. But also 512, 1024, 2048 ... this is a classic compromise between time / space.

PS: for very large numbers

 (num >> offset*8) & 0xff 

will work faster. This is because your num & byte_check takes time proportionally to log(num) - it usually should create an integer no larger than num .

+9
source

You can simply use one of the existing C or C ++ implementations, such as integer_sort from Boost.Sort or u4_sort from usort . It is surprisingly easy to call native C or C ++ code from Python, see How to sort an array of integers faster than quicksort?

I'm completely upset. Although more than two years have passed, numpy still has no radix collation . I will let NumPy developers know that they can just capture one of the existing implementations; licensing should not be a problem.

0
source

This is an old thread, but I came across this when I looked at radix sort an array of positive integers. I was trying to figure out if I could do something better than the already viciously fast timsort (again hats to you, Tim Peters) that implements python sort and sort! Either I do not understand some aspects of the above code, or, if I do, the code presented above has some IMHO problems.

  • It sorts only bytes, starting with the highest byte of the smallest element and ending with the highest byte of the largest element. This may be good in some cases of special data. But in general, the approach cannot distinguish elements that differ due to the least significant bits. For instance:

     arr=[65535,65534] radix_sort(arr) 

    outputs the wrong output:

     [65535, 65534] 
  • The range used to cycle the auxiliary function is invalid. I mean, if low_byte and maximum_byte are the same, the execution of the helper function is generally skipped. BTW I had to change xrange to a range in 2 places.

  • With the changes addressed above 2 points, I got it to work. But it takes 10-20 times longer to sort or sort python! I know that timsort is very efficient and uses already sorted runs in the data. But I was trying to figure out if I can use prior knowledge that my data is positive integers that can be useful in my sorting. Why does radix sorting do this poorly compared to timsort? The dimensions of the array that I used are about 80 thousand units. Is this because the timsort implementation, in addition to its algorithmic efficiency, also has other advantages associated with the possible use of low-level libraries? Or am I missing something? The modified code that I used below:

     import itertools def radix_sort(unsorted): "Fast implementation of radix sort for any size num." maximum, minimum = max(unsorted), min(unsorted) max_bits = maximum.bit_length() highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1 # min_bits = minimum.bit_length() # lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1 sorted_list = unsorted # xrange changed to range, lowest_byte deleted from the arguments for offset in range(highest_byte): sorted_list = radix_sort_offset(sorted_list, offset) return sorted_list def radix_sort_offset(unsorted, offset): "Helper function for radix sort, sorts each offset." byte_check = (0xFF << offset*8) # xrange changed to range buckets = [[] for _ in range(256)] for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) return list(itertools.chain.from_iterable(buckets)) 
0
source

Source: https://habr.com/ru/post/958818/


All Articles