Group by max or min in a numpy array

I have two 1D numpy arrays of equal length, id and data , where id is a sequence of repeated ordered integers that define sub-windows on data . For instance,

 id data 1 2 1 7 1 3 2 8 2 9 2 10 3 1 3 -10 

I would like to combine data by grouping by id and accepting either max or min. In SQL, this will be a typical aggregation query, for example SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id . Is there a way to avoid Python loops and do it in vector form, or do I need to go down to C?

+6
source share
7 answers

I have seen some very similar stack overflow issues over the past few days. The following code is very similar to the implementation of numpy.unique, and since it takes advantage of the basic numpy mechanism, it will most likely be faster than anything you can do in a python loop.

 import numpy as np def group_min(groups, data): # sort with major key groups, minor key data order = np.lexsort((data, groups)) groups = groups[order] # this is only needed if groups is unsorted data = data[order] # construct an index which marks borders between groups index = np.empty(len(groups), 'bool') index[0] = True index[1:] = groups[1:] != groups[:-1] return data[index] #max is very similar def group_max(groups, data): order = np.lexsort((data, groups)) groups = groups[order] #this is only needed if groups is unsorted data = data[order] index = np.empty(len(groups), 'bool') index[-1] = True index[:-1] = groups[1:] != groups[:-1] return data[index] 
+8
source

In pure Python:

 from itertools import groupby, imap, izip from operator import itemgetter as ig print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))] # -> [7, 10, 1] 

Option:

 print [data[id==i].max() for i, _ in groupby(id)] # -> [7, 10, 1] 

Based on @ Howl answer :

 import numpy as np # sort by `id` then by `data` ndx = np.lexsort(keys=(data, id)) id, data = id[ndx], data[ndx] # get max() print data[np.r_[np.diff(id), True].astype(np.bool)] # -> [ 7 10 1] 

If pandas installed:

 from pandas import DataFrame df = DataFrame(dict(id=id, data=data)) print df.groupby('id')['data'].max() # id # 1 7 # 2 10 # 3 1 
+6
source

I am new to Python and Numpy, but it looks like you can use the .at ufunc method rather than reduceat :

 import numpy as np data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5]) data_val = np.random.rand(len(data_id)) ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead np.maximum.at(ans,data_id,data_val) 

For instance:

 data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668, 0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868, 0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416, 0.33184318, 0.10936647]) ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113, 0.37976055]) 

Of course, this only makes sense if your data_id values ​​are suitable for use as indices (i.e. non-negative integers and not huge ... presumably if they are large / sparse, you can initialize ans with np.unique(data_id) or something).

I must indicate that the data_id does not really need to be sorted.

+3
source

Ive packaged the version of my previous answer in the numpy_indexed package; its nice to have it all wrapped up and tested in a neat interface; plus it has much more functionality:

 import numpy_indexed as npi group_id, group_max_data = group_by(id).max(data) 

Etc

+1
source

I think this accomplishes what you are looking for:

 [max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))] 

To understand the external list, from right to left, set(id) groups id s, sorted() sorts them, for k ... iterates over them, and max takes the maximum value in this case, another understanding of the list. So, we move on to understanding the internal list: enumerate(data) returns both the index and the value from data , if id[val] == k selects the data members corresponding to id k .

This is repeated over a complete list of data for each id . With some pre-processing in the sublists, it may be possible to speed it up, but then it won't be single-line.

0
source

The following solution only requires sorting the data (not lexsort) and does not require finding boundaries between groups. He relies on the fact that if o is an array of indices in r , then r[o] = x fill r last value x for each value of o , so r[[0, 0]] = [1, 2] will return r[0] = 2 . This requires that your groups be integers from 0 to the number of groups - 1, as for numpy.bincount , and that for each group there is a value:

 def group_min(groups, data): n_groups = np.max(groups) + 1 result = np.empty(n_groups) order = np.argsort(data)[::-1] result[groups.take(order)] = data.take(order) return result def group_max(groups, data): n_groups = np.max(groups) + 1 result = np.empty(n_groups) order = np.argsort(data) result[groups.take(order)] = data.take(order) return result 
0
source

A somewhat faster and more general answer than already accepted; as joeln's answer, it avoids the more expensive lexsort, and it works for arbitrary ufunc. Moreover, it only requires that the keys be sortable, and not ints in a specific range. The accepted answer may still be faster, although given that max / min is not explicitly calculated. The ability to ignore decisions is neat; but you can also just assign a dummy key to the values ​​of n.

 import numpy as np def group(key, value, operator=np.add): """ group the values by key any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on) returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts """ #upcast to numpy arrays key = np.asarray(key) value = np.asarray(value) #first, sort by key I = np.argsort(key) key = key[I] value = value[I] #the slicing points of the bins to sum over slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1)) #first entry of each bin is a unique key unique_keys = key[slices] #reduce over the slices specified by index per_key_sum = operator.reduceat(value, slices) #number of counts per key is the difference of our slice points. cap off with number of keys for last bin key_count = np.diff(np.append(slices, len(key))) return unique_keys, per_key_sum, key_count names = ["a", "b", "b", "c", "d", "e", "e"] values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01] unique_keys, reduced_values, key_count = group(names, values) print 'per group mean' print reduced_values / key_count unique_keys, reduced_values, key_count = group(names, values, np.minimum) print 'per group min' print reduced_values unique_keys, reduced_values, key_count = group(names, values, np.maximum) print 'per group max' print reduced_values 
0
source

Source: https://habr.com/ru/post/904424/


All Articles