Sorting a list of lists by item frequency in Python 2.3

Question

Sorting a list of lists by item frequency in Python 2.3

I have a list with signatures of such elements.

mylist = [ ['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO'], ['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'], ['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE'] ]

Now I want to sort the sub-lists under this condition - the more each line (i.e. the sub-list) has the elements 'YES' and 'MAYBE' , the higher it moves up. The more 'NO' each row has, the lower it moves in the sort list.

The perfect result is

 mylist = [ ['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE'], ['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'], ['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO'] ] #Item C has 4 'YES' and 2 'MAYBE' #Item B has 3 'YES' and 1 'MAYBE' #Item C has 3 'YES'

Unfortunately, I am stuck on Python 2.3 and you need to find the most efficient way to do this.

+1

python sorting list

GPX Jun 18 '12 at 13:33

source share

2 answers

You can use the cmp parameter to sort by key in Python 2.3 or lower. But sometimes sorting the key style is easier to read; and in any case, it does less work, since cmp will be called O (n log n) times, and the key function will be called only O (n) times.

With this in mind, you can reproduce the behavior of the key parameter in later versions of Python. He uses the hieromate decorate-sort-undecorate, aka Schwartzian Transform . It will not be so effective because it makes copies, but for large lists it is more likely to be more efficient. I called it sorted because it roughly reproduces the sorted function added in 2.4; check the python version and conditionally import it so that you don't break the built-in sorted in newer versions - or just rename it.

 def sorted(seq, key=lambda x: None, reverse=False): seq = [(key(x), i, x) for i, x in enumerate(seq)] seq.sort() if reverse: seq.reverse() return [x for k, i, x in seq]

Note that enumerate is only necessary if you care about stable sorting by unequal values with equal keys; It slows down hair function. Tested according to your data:

 >>> key=lambda x: (x.count('YES'), x.count('MAYBE'), x.count('NO')) >>> my_sorted(mylist, key=key, reverse=True) [['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE'], ['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'], ['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO']]

You can also use a dictionary to count; Thus, only one pass is required. However, count is optimized enough that three passes are still faster than one Python for loop, at least on my machine. So use this only if you need to count a lot of values. I will leave this here for posterity:

 def my_key(inner_list): counts = {'YES':0, 'MAYBE':0, 'NO':0} for i in inner_list: if i in counts: counts[i] += 1 return (counts['YES'], counts['MAYBE'], counts['NO'])

I did some testing; apologies for the long post. The following is only curious and inquisitive.

My tests show that decorating, sorting, undecorate in a smaller list is already faster than using the built-in + cmp sorting. On a larger list, the difference becomes more dramatic. Definitions:

 def key_count(x): return (x.count('YES'), x.count('MAYBE'), x.count('NO')) def key_dict(inner_list): counts = {'YES':0, 'MAYBE':0, 'NO':0} for i in inner_list: if i in counts: counts[i] += 1 return (counts['YES'], counts['MAYBE'], counts['NO']) def decorate_sort(seq, key=lambda x: None, reverse=False): seq = [(key(x), i, x) for i, x in enumerate(seq)] seq.sort() if reverse: seq.reverse() return [x for k, i, x in seq] def builtin_sort(seq, key, reverse=False): seq.sort(lambda p, q: cmp(key(p), key(q))) if reverse: seq.reverse()

Tests:

 >>> mylist = [ ... ['ITEM A', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO'], ... ['ITEM B', 'YES', 'NO', 'YES', 'YES', 'NO', 'NO', 'NO', 'NO', 'MAYBE'], ... ['ITEM C', 'YES', 'YES', 'YES', 'YES', 'NO', 'NO', 'MAYBE', 'NO', 'MAYBE'] ... ] >>> %timeit decorate_sort(mylist, key=key_count, reverse=True) 100000 loops, best of 3: 5.03 us per loop >>> %timeit builtin_sort(mylist, key=key_count, reverse=True) 100000 loops, best of 3: 5.28 us per loop

The built-in version is already slower! A less generalized version of mylist.sort(lambda p, q: -cmp(key(p), key(q))) better for short list hair by adding enumerate to decorate_sort ; without it, decorate_sort works faster (4.28 us per cycle in my previous test):

 >>> %timeit mylist.sort(lambda p, q: -cmp(key_count(p), key_count(q))) 100000 loops, best of 3: 4.74 us per loop

Using key_dict is an error in this case:

 >>> %timeit decorate_sort(mylist, key=key_dict, reverse=True) 100000 loops, best of 3: 8.97 us per loop >>> %timeit builtin_sort(mylist, key=key_dict, reverse=True) 100000 loops, best of 3: 11.4 us per loop

By checking this on a larger list, basically the same results are saved:

 >>> import random >>> mylist = [[random.choice(('YES', 'MAYBE', 'NO')) for _ in range(1000)] for _ in range(100)] >>> %timeit decorate_sort(mylist, key=key_count, reverse=True) 100 loops, best of 3: 6.93 ms per loop >>> %timeit builtin_sort(mylist, key=key_count, reverse=True) 10 loops, best of 3: 34.5 ms per loop

A less generalized version is now slower than decorate_sort .

 >>> %timeit mylist.sort(lambda p, q: -cmp(key_count(p), key_count(q))) 100 loops, best of 3: 13.5 ms per loop

And key_dict is still slower. (But faster than builtin_sort !)

 >>> %timeit decorate_sort(mylist, key=key_dict, reverse=True) 10 loops, best of 3: 20.4 ms per loop >>> %timeit builtin_sort(mylist, key=key_dict, reverse=True) 10 loops, best of 3: 103 ms per loop

So, the result is that the Schwartz transformation provides a solution that is faster and more generalized — a rare and wonderful combination.

+3

senderle Jun 18 '12 at 14:38

source share

ecatmur · Accepted Answer · 2012-06-18T13:39:58+0000

General solution: use list.sort with a key function that returns a tuple:

 mylist.sort(key=lambda sl: (sl.count('YES') + sl.count('MAYBE'), -sl.count('NO')), reverse=True)

key and reverse were added in Python 2.4, so you have to do it manually:

 key = lambda sl: (sl.count('YES') + sl.count('MAYBE'), -sl.count('NO')) mylist.sort(lambda p, q: -cmp(key(p), key(q)))

If key slow, it is better to use a solution that calculates the key function for each element only once (the so-called " Schwartz transform "). Note that> = Python 2.4 performs this optimization (or similar) already:

 def key_sort(seq, cmp=None, key=None, reverse=False): if key is not None: transform = [(key(x), i, x) for i, x in enumerate(seq)] transform.sort(None if cmp is None else lambda (k, _, _), (l, _, _): cmp(k, l)) seq[:] = [x for _, _, x in transform] else: seq.sort(cmp) if reverse: seq.reverse()

Sorting a list of lists by item frequency in Python 2.3

More articles: