How to use numpy to get the total score for unique values in linear time?

Question

How to use numpy to get the total score for unique values in linear time?

Consider the following short_list and long_list

 short_list = list('aaabaaacaaadaaac') np.random.seed([3,1415]) long_list = pd.DataFrame( np.random.choice(list(ascii_letters), (10000, 2)) ).sum(1).tolist()

How to calculate the total score by a unique value?

I want to use numpy and do it in linear time. I want this to compare timings with my other methods. This may be easiest to illustrate with my first proposed solution.

 def pir1(l): s = pd.Series(l) return s.groupby(s).cumcount().tolist() print(np.array(short_list)) print(pir1(short_list)) ['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c'] [0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]

I tried myself trying to use np.unique because it returns an array of counts, an inverse array and an index array. I was sure that I could solve it. The best I got is in pir4 below the scale in quadratic time. Also note that I don't care if the count starts at 1 or 0, since we can just add or subtract 1.

Below are some of my attempts (none of which answer my question)

 %%cython from collections import defaultdict def get_generator(l): counter = defaultdict(lambda: -1) for i in l: counter[i] += 1 yield counter[i] def pir2(l): return [i for i in get_generator(l)]

 def pir3(l): return [i for i in get_generator(l)] def pir4(l): unq, inv = np.unique(l, 0, 1, 0) a = np.arange(len(unq)) matches = a[:, None] == inv return (matches * matches.cumsum(1)).sum(0).tolist()

+6

python numpy pandas

piRSquared Nov 15 '16 at 4:51

source share

3 answers

In addition to defaultdict there are several more counters. Testing a slightly simpler case:

 In [298]: from collections import defaultdict In [299]: from collections import defaultdict, Counter In [300]: def foo(l): ...: counter = defaultdict(int) ...: for i in l: ...: counter[i] += 1 ...: return counter ...: In [301]: short_list = list('aaabaaacaaadaaac') In [302]: foo(short_list) Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1}) In [303]: Counter(short_list) Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1}) In [304]: arr=[ord(i)-ord('a') for i in short_list] In [305]: np.bincount(arr) Out[305]: array([12, 1, 2, 1], dtype=int32)

I built arr because bincount only works with ints.

 In [306]: timeit np.bincount(arr) The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 5.63 µs per loop In [307]: timeit Counter(arr) 100000 loops, best of 3: 13.6 µs per loop In [308]: timeit foo(arr) 100000 loops, best of 3: 6.49 µs per loop

I suppose it would be difficult to improve pir2 based on default_dict.

Searching and counting like this is not a strong area for numpy .

+4

hpaulj Nov 15 '16 at 5:16

source share

Customization

 short_list = np.array(list('aaabaaacaaadaaac'))

Functions

dfill takes an array and returns the positions at which the array changes and repeats that index position until the next change.

 # dfill # # Example with short_list # # 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15 # [ aaabaaacaaadaaac] # # Example with short_list after sorting # # 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15 # [ aaaaaaaaaaaabccd]

argunsort returns the permutation needed to undo the sort given the argsort array. The existence of this method became known to me through this post. . With this, I can get the argsort array and sort the array with it. Then I can cancel the sorting without the extra cost of sorting.

cumcount will take an array, sort it, find a dfill array. np.arange less dfill will give me a cumulative score. Then i'm un-sort

 # cumcount # # Example with short_list # # short_list: # [ aaabaaacaaadaaac] # # short_list.argsort(): # [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11] # # Example with short_list after sorting # # short_list[short_list.argsort()]: # [ aaaaaaaaaaaabccd] # # dfill(short_list[short_list.argsort()]): # [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15] # # np.range(short_list.size): # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] # # np.range(short_list.size) - # dfill(short_list[short_list.argsort()]): # [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0] # # unsorted: # [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]

foo function recommended by @hpaulj with defaultdict
div function recommended by @Divakar (old, I'm sure it updated it)

the code

 def dfill(a): n = a.size b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]]) return np.arange(n)[b[:-1]].repeat(np.diff(b)) def argunsort(s): n = s.size u = np.empty(n, dtype=np.int64) u[s] = np.arange(n) return u def cumcount(a): n = a.size s = a.argsort(kind='mergesort') i = argunsort(s) b = a[s] return (np.arange(n) - dfill(b))[i] def foo(l): n = len(l) r = np.empty(n, dtype=np.int64) counter = defaultdict(int) for i in range(n): counter[l[i]] += 1 r[i] = counter[l[i]] return r - 1 def div(l): a = np.unique(l, return_counts=1)[1] idx = a.cumsum() id_arr = np.ones(idx[-1],dtype=int) id_arr[0] = 0 id_arr[idx[:-1]] = -a[:-1]+1 rng = id_arr.cumsum() return rng[argunsort(np.argsort(l))]

Demonstration

 cumcount(short_list) array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])

time testing

the code

 functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function') lengths = pd.RangeIndex(100, 1100, 100, 'array length') results = pd.DataFrame(index=lengths, columns=functions) from string import ascii_letters for i in lengths: a = np.random.choice(list(ascii_letters), i) for j in functions: results.set_value( i, j, timeit( '{}(a)'.format(j), 'from __main__ import a, {}'.format(j), number=1000 ) ) results.plot()

+4

piRSquared Jan 9 '17 at 10:37

source share

Divakar · Accepted Answer · 2016-11-15T08:32:56+0000

Here we use a vector approach using the function to create an arbitrary range grouping and np.unique to obtain samples -

 def grp_range(a): idx = a.cumsum() id_arr = np.ones(idx[-1],dtype=int) id_arr[0] = 0 id_arr[idx[:-1]] = -a[:-1]+1 return id_arr.cumsum() count = np.unique(A,return_counts=1)[1] out = grp_range(count)[np.argsort(A).argsort()]

Run Example -

 In [117]: A = list('aaabaaacaaadaaac') In [118]: count = np.unique(A,return_counts=1)[1] ...: out = grp_range(count)[np.argsort(A).argsort()] ...: In [119]: out Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])

To get count you can offer several alternatives with a focus on performance -

 np.bincount(np.unique(A,return_inverse=1)[1]) np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)

Also, with A containing single-letter characters, we could get the score simply by using <

 np.bincount(np.array(A).view('uint8')-97)

How to use numpy to get the total score for unique values ​​in linear time?

Customization

Functions

the code

Demonstration

time testing

the code

More articles:

How to use numpy to get the total score for unique values in linear time?