How to use numpy to get the total score for unique values ​​in linear time?

Consider the following short_list and long_list

 short_list = list('aaabaaacaaadaaac') np.random.seed([3,1415]) long_list = pd.DataFrame( np.random.choice(list(ascii_letters), (10000, 2)) ).sum(1).tolist() 

How to calculate the total score by a unique value?

I want to use numpy and do it in linear time. I want this to compare timings with my other methods. This may be easiest to illustrate with my first proposed solution.

 def pir1(l): s = pd.Series(l) return s.groupby(s).cumcount().tolist() print(np.array(short_list)) print(pir1(short_list)) ['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c'] [0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1] 

I tried myself trying to use np.unique because it returns an array of counts, an inverse array and an index array. I was sure that I could solve it. The best I got is in pir4 below the scale in quadratic time. Also note that I don't care if the count starts at 1 or 0, since we can just add or subtract 1.

Below are some of my attempts (none of which answer my question)

 %%cython from collections import defaultdict def get_generator(l): counter = defaultdict(lambda: -1) for i in l: counter[i] += 1 yield counter[i] def pir2(l): return [i for i in get_generator(l)] 

 def pir3(l): return [i for i in get_generator(l)] def pir4(l): unq, inv = np.unique(l, 0, 1, 0) a = np.arange(len(unq)) matches = a[:, None] == inv return (matches * matches.cumsum(1)).sum(0).tolist() 

enter image description here

enter image description here

+6
source share
3 answers

Here we use a vector approach using the function to create an arbitrary range grouping and np.unique to obtain samples -

 def grp_range(a): idx = a.cumsum() id_arr = np.ones(idx[-1],dtype=int) id_arr[0] = 0 id_arr[idx[:-1]] = -a[:-1]+1 return id_arr.cumsum() count = np.unique(A,return_counts=1)[1] out = grp_range(count)[np.argsort(A).argsort()] 

Run Example -

 In [117]: A = list('aaabaaacaaadaaac') In [118]: count = np.unique(A,return_counts=1)[1] ...: out = grp_range(count)[np.argsort(A).argsort()] ...: In [119]: out Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]) 

To get count you can offer several alternatives with a focus on performance -

 np.bincount(np.unique(A,return_inverse=1)[1]) np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97) 

Also, with A containing single-letter characters, we could get the score simply by using <

 np.bincount(np.array(A).view('uint8')-97) 
+4
source

In addition to defaultdict there are several more counters. Testing a slightly simpler case:

 In [298]: from collections import defaultdict In [299]: from collections import defaultdict, Counter In [300]: def foo(l): ...: counter = defaultdict(int) ...: for i in l: ...: counter[i] += 1 ...: return counter ...: In [301]: short_list = list('aaabaaacaaadaaac') In [302]: foo(short_list) Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1}) In [303]: Counter(short_list) Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1}) In [304]: arr=[ord(i)-ord('a') for i in short_list] In [305]: np.bincount(arr) Out[305]: array([12, 1, 2, 1], dtype=int32) 

I built arr because bincount only works with ints.

 In [306]: timeit np.bincount(arr) The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 5.63 Β΅s per loop In [307]: timeit Counter(arr) 100000 loops, best of 3: 13.6 Β΅s per loop In [308]: timeit foo(arr) 100000 loops, best of 3: 6.49 Β΅s per loop 

I suppose it would be difficult to improve pir2 based on default_dict.

Searching and counting like this is not a strong area for numpy .

+4
source

Customization

 short_list = np.array(list('aaabaaacaaadaaac')) 

Functions

  • dfill takes an array and returns the positions at which the array changes and repeats that index position until the next change.

     # dfill # # Example with short_list # # 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15 # [ aaabaaacaaadaaac] # # Example with short_list after sorting # # 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15 # [ aaaaaaaaaaaabccd] 
  • argunsort returns the permutation needed to undo the sort given the argsort array. The existence of this method became known to me through this post. . With this, I can get the argsort array and sort the array with it. Then I can cancel the sorting without the extra cost of sorting.
  • cumcount will take an array, sort it, find a dfill array. np.arange less dfill will give me a cumulative score. Then i'm un-sort

     # cumcount # # Example with short_list # # short_list: # [ aaabaaacaaadaaac] # # short_list.argsort(): # [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11] # # Example with short_list after sorting # # short_list[short_list.argsort()]: # [ aaaaaaaaaaaabccd] # # dfill(short_list[short_list.argsort()]): # [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15] # # np.range(short_list.size): # [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] # # np.range(short_list.size) - # dfill(short_list[short_list.argsort()]): # [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0] # # unsorted: # [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1] 
  • foo function recommended by @hpaulj with defaultdict
  • div function recommended by @Divakar (old, I'm sure it updated it)

the code

 def dfill(a): n = a.size b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]]) return np.arange(n)[b[:-1]].repeat(np.diff(b)) def argunsort(s): n = s.size u = np.empty(n, dtype=np.int64) u[s] = np.arange(n) return u def cumcount(a): n = a.size s = a.argsort(kind='mergesort') i = argunsort(s) b = a[s] return (np.arange(n) - dfill(b))[i] def foo(l): n = len(l) r = np.empty(n, dtype=np.int64) counter = defaultdict(int) for i in range(n): counter[l[i]] += 1 r[i] = counter[l[i]] return r - 1 def div(l): a = np.unique(l, return_counts=1)[1] idx = a.cumsum() id_arr = np.ones(idx[-1],dtype=int) id_arr[0] = 0 id_arr[idx[:-1]] = -a[:-1]+1 rng = id_arr.cumsum() return rng[argunsort(np.argsort(l))] 

Demonstration

 cumcount(short_list) array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]) 

time testing

the code

 functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function') lengths = pd.RangeIndex(100, 1100, 100, 'array length') results = pd.DataFrame(index=lengths, columns=functions) from string import ascii_letters for i in lengths: a = np.random.choice(list(ascii_letters), i) for j in functions: results.set_value( i, j, timeit( '{}(a)'.format(j), 'from __main__ import a, {}'.format(j), number=1000 ) ) results.plot() 

enter image description here

+4
source

Source: https://habr.com/ru/post/1012329/


All Articles