I was very upset that many of the python radix implementations are there on the Internet.
They consistently use a radius of 10 and get the digits of the numbers they iterate over, dividing by the power of 10 or taking log10 numbers. This is incredibly inefficient as log10 is not a particularly fast operation compared to a bit shift, which is almost 100 times faster!
In a much more efficient implementation, a radius of 256 is used and a byte number is sorted. This allows you to do all the "byte processing" using ridiculously fast bit operators. Unfortunately, it seems that absolutely no one has implemented radix sorting in python, which uses bitwise operators instead of logarithms.
So, I took matters into my own hands and came up with this beast, which works at about half the speed of sorting by small arrays and works almost as fast on large ones (for example, len about 10,000,000):
import itertools def radix_sort(unsorted): "Fast implementation of radix sort for any size num." maximum, minimum = max(unsorted), min(unsorted) max_bits = maximum.bit_length() highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1 min_bits = minimum.bit_length() lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1 sorted_list = unsorted for offset in xrange(lowest_byte, highest_byte): sorted_list = radix_sort_offset(sorted_list, offset) return sorted_list def radix_sort_offset(unsorted, offset): "Helper function for radix sort, sorts each offset." byte_check = (0xFF << offset*8) buckets = [[] for _ in xrange(256)] for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) return list(itertools.chain.from_iterable(buckets))
This version of radix collation works by determining which bytes it should sort by (if you pass it only integers below 256, it will sort only one byte, etc.) and then sort each byte from LSB by throwing them into buckets to just put the buckets together. Repeat this for each byte to be sorted, and you have a nice sorted array in O (n) time.
However, it is not as fast as it could be, and I would like to do it faster before writing about it as a better radis than about all other versions of radix.
Running cProfile on this tells me that a lot of time is wasted on the append method for lists, which makes me think that this block:
for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num)
in radix_sort_offset eats a lot of time. This is also a block that, if you really look at it, does 90% of the work for the whole view. This code looks as if it could be numpy -ized, which, I think, will lead to a significant increase in performance. Unfortunately, I am not very good with numpy more complex functions, so I could not figure it out. Help would be greatly appreciated.
I am currently using itertools.chain.from_iterable to align buckets , but if anyone has a faster suggestion, I am sure this will help too.
Initially, I had a get_byte function that returned the nth byte of a number, but nesting the code gave me a huge speedup, so I did it.
Any other implementation comments or ways to squeeze more performance are also welcome. I want to hear anything and everything that you have.