How to optimally count items in a python list

Question

How to optimally count items in a python list

This is almost the same question as here , except that I am asking about the most efficient solution for a sorted result.

I have a list (about 10 integers randomly between 0 and 12), for example:

the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]

I want to create a function that returns a list of tuples (item, counts) ordered by the first element, e.g.

 output = [(4, 3), (5, 4), (6, 1), (7, 2)]

So far I have used:

 def dupli(the_list): return [(item, the_list.count(item)) for item in sorted(set(the_list))]

But I call this function almost a millionth time, and I need to do it as fast as I (python). So my question is: How to make this function less time? (what about memory?)

I played a little, but nothing obvious appeared:

 from timeit import Timer as T number=10000 setup = "the_list=[5, 7, 6, 5, 5, 4, 4, 7, 5, 4]" stmt = "[(item, the_list.count(item)) for item in sorted(set(the_list))]" T(stmt=stmt, setup=setup).timeit(number=number) Out[230]: 0.058799982070922852 stmt = "L = []; \nfor item in sorted(set(the_list)): \n L.append((item, the_list.count(item)))" T(stmt=stmt, setup=setup).timeit(number=number) Out[233]: 0.065041065216064453 stmt = "[(item, the_list.count(item)) for item in set(sorted(the_list))]" T(stmt=stmt, setup=setup).timeit(number=number) Out[236]: 0.098351955413818359

thanks
Christophe

+4

performance python list count

Christophe Dec 16 '10 at 1:35

source share

6 answers

I would try:

 from collections import defaultdict output = defaultdict(lambda: 0) for item in the_list: output[item] += 1 return sorted(output.items())

+3

Karl Knechtel Dec 16 '10 at 1:50

source share

Using the qualification "from 0 to 12":

 >>> the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4] >>> answer1 = [0] * 13 >>> for i in the_list: ... answer1[i] += 1 ... >>> answer1 [0, 0, 0, 0, 3, 4, 1, 2, 0, 0, 0, 0, 0] >>> # You might be able to use that as-is: ... >>> for i, v in enumerate(answer1): ... if v: print i, v ... 4 3 5 4 6 1 7 2 >>> # Otherwise you can build the list that you specified: ... >>> answer2 = [(i, v) for i, v in enumerate(answer1) if v] >>> answer2 [(4, 3), (5, 4), (6, 1), (7, 2)] >>>

+2

John machin Dec 16 '10 at 2:18

source share

It might be faster to write your own function that counts numbers in a single pass through the list. You call the count function for each number in the set, and each of these calls requires passing through a list.

 counts = {} for n in the_list: if n not in counts: counts[n] = 0 counts[n] += 1 sorted(counts.items())

0

Colin Dec 16 '10 at 1:52

source share

This seems pretty optimal in terms of space and speed:

 def dupli2(list_): dict_ = {} for item in list_: dict_[item] = dict_.get(item, 0) + 1 return sorted(dict_.items())

Or that:

 def dupli3(list_): last = None list_ = sorted(list_) i = 0 for item in list_: if item != last and last is not None: yield last, ii = 0 i += 1 last = item yield last, i

Not sure about the speed. For this, I recommend that you either do it in C or use Psyco;)

With Psyco:

 In [33]: %timeit list(dupli3(test.the_list)) 100000 loops, best of 3: 6.46 us per loop In [34]: %timeit list(dupli2(test.the_list)) 100000 loops, best of 3: 2.37 us per loop In [35]: %timeit list(dupli(test.the_list)) 100000 loops, best of 3: 2.7 us per loop

0

Wolph Dec 16 '10 at 1:52

source share

itertools.groupby is perfect for this:

 >>> from itertools import groupby >>> the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4] >>> gb = groupby(sorted(the_list)) >>> print [(i,len(list(j))) for i,j in gb] [(4, 3), (5, 4), (6, 1), (7, 2)]

0

Paulmcg Dec 16 '10 at 9:37

source share

Steven rumbalski · Accepted Answer · 2010-12-16T05:17:42+0000

Change the sorting location to save about 20%.

Change this:

 def dupli(the_list): return [(item, the_list.count(item)) for item in sorted(set(the_list))]

For this:

 def dupli(the_list): count = the_list.count # this optimization added courtesy of Sven comment result = [(item, count(item)) for item in set(the_list)] result.sort() return result

The reason this happens faster is because the sorted iterator sorted to create a temporary list, while sorting the results is sorted in place.

edit Here's a different approach, which is 35% faster than your original:

 def dupli(the_list): counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] for n in the_list: counts[n] += 1 return [(i, counts[i]) for i in (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) if counts[i]]

Note. You can randomize the values for the_list . My final version of dupli tested even faster with other random data sets ( import random; the_list=[random.randint(0,12) for i in xrange(10)] )

How to optimally count items in a python list

More articles: