Consider the following short_list
and long_list
short_list = list('aaabaaacaaadaaac') np.random.seed([3,1415]) long_list = pd.DataFrame( np.random.choice(list(ascii_letters), (10000, 2)) ).sum(1).tolist()
How to calculate the total score by a unique value?
I want to use numpy and do it in linear time. I want this to compare timings with my other methods. This may be easiest to illustrate with my first proposed solution.
def pir1(l): s = pd.Series(l) return s.groupby(s).cumcount().tolist() print(np.array(short_list)) print(pir1(short_list)) ['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c'] [0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]
I tried myself trying to use np.unique
because it returns an array of counts, an inverse array and an index array. I was sure that I could solve it. The best I got is in pir4
below the scale in quadratic time. Also note that I don't care if the count starts at 1 or 0, since we can just add or subtract 1.
Below are some of my attempts (none of which answer my question)
%%cython from collections import defaultdict def get_generator(l): counter = defaultdict(lambda: -1) for i in l: counter[i] += 1 yield counter[i] def pir2(l): return [i for i in get_generator(l)]
def pir3(l): return [i for i in get_generator(l)] def pir4(l): unq, inv = np.unique(l, 0, 1, 0) a = np.arange(len(unq)) matches = a[:, None] == inv return (matches * matches.cumsum(1)).sum(0).tolist()

