Group numbers in changing the value of an array in steps

Question

Group numbers in changing the value of an array in steps

i have an array such as [101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 380, 377, 379, 382, 466, 469, 471 , 472, 557, 559, 562, 566, 569 ...]

In this array, after several integers, the step in the value will change (for example, between [101,107,106] and [199,204, ...]). Or else, an array consists of groups of integers, each group with values centered around an unknown value. But I will not know how many groups in total, and the number of integers in each of them is indefinite.

How can I group these integers every time I change a step into different arrays.

thanks

+1

python arrays algorithm grouping

melonb Nov 14 '17 at 16:29

source share

3 answers

I tried to do what I suggested in my comment. I think this will provide a good solution to a more generalized problem, but I caution that I did not think about all cases of edges or considered algorithmic complexity here.

 import numpy as np # function to initialize clusters def init_clusters(x, num_elements_per_cluster=3): # initialize clusters by splitting into n groups x.sort() # sort the list nclusters = len(x) / num_elements_per_cluster clusters = {i: {'values': []} for i in range(nclusters)} # assign to clusters (helps that list is sorted) for i in range(len(x)): index = min(i/num_elements_per_cluster, nclusters-1) clusters[index]['values'].append(x[i]) # compute variance for index in clusters: clusters[index]['var'] = np.var(clusters[index]['values']) return clusters def get_avg_var(clusters): total_var = 0.0 denom = 0.0 for index in clusters: total_var += clusters[index]['var'] * len(clusters[index]['values']) denom += len(clusters[index]['values']) return total_var / denom # possible div by 0, but shouldn't happen def assign_value_to_cluster(clusters, value): """ add value to a cluster such that results in the lowest variance """ new_cluster_vars = [] indices = [] for index in clusters: new_cluster_vars.append(np.var(clusters[index]['values'] + [value])) indices.append(index) index_min_new_cluster_var = indices[np.argmin(new_cluster_vars)] clusters[index_min_new_cluster_var]['values'].append(value) # update the variances clusters[index_min_new_cluster_var]['var'] = new_cluster_vars[index_min_new_cluster_var] def purify(clusters): curr_var = get_avg_var(clusters) prev_var = curr_var*10 max_iter = 1000 iter_count = 0 while(curr_var < prev_var): if iter_count > max_iter: break prev_var = curr_var # start with the cluster with the highest variance sorted_vars = sorted( [{'index': i, 'var': clusters[i]['var']} for i in clusters], key=lambda x: x['var'], reverse=True ) highest_var_index = sorted_vars[0]['index'] vals = clusters[highest_var_index]['values'] if len(vals) > 2: # find the element that when removed will minimize the variance of this cluster dropout_variance = [np.var([vals[j] for j in range(len(vals)) if j != i]) for i in range(len(vals))] index_to_drop = np.argmin(dropout_variance) value_to_reassign = clusters[highest_var_index]['values'].pop(index_to_drop) # update the variances clusters[highest_var_index]['var'] = dropout_variance[index_to_drop] assign_value_to_cluster(clusters, value_to_reassign) else: # break this cluster and assign values to others clusters.pop(highest_var_index) for val in vals: assign_value_to_cluster(clusters, val) curr_var = get_avg_var(clusters) print "after iter %04d: %04.2f" % (iter_count, curr_var) iter_count += 1 return clusters

Run the algorithm on the OP data sample:

 # vector x of values that we want to cluster x = [ 101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 380, 377, 379, 382, 466, 469, 471, 472, 557, 559, 562, 566, 569 ] clusters = init_clusters(x) final_clusters = purify(clusters) # print values of the final clusters [final_clusters[y]['values'] for y in final_clusters]

Output:

 [[101, 106, 107], [204, 205, 207, 199], [306, 310], [312, 312, 314], [317, 318], [379, 380, 382, 377], [466, 469, 471, 472], [557, 559], [562, 566, 569]]

Edit: Fixed a bug in get_avg_var() , and I realized that I was not updating cluster deviations. This is sensitive to initialization, but it usually provides reasonable clusters. In this case, you can define your own optimization parameter (instead of using the average dispersion of the cluster, like me).

0

pault Nov 14 '17 at 18:51

source share

It seems like a step arises from your published code when abs(array[i]-array[i+1]) > 6 . You can use this:

 final = [] current = [] arr = [101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 380, 377, 379, 382, 466, 469, 471, 472, 557, 559, 562, 566, 569] for i in range(len(arr)-1): if abs(arr[i] - arr[i+1]) > 6: current.append(arr[i]) final.append(current) current = [] else: current.append(arr[i])

Output:

 [[101, 107, 106], [199, 204, 205, 207], [306, 310, 312, 312, 314, 317, 318], [380, 377, 379, 382], [466, 469, 471, 472]]

-1

Ajax1234 Nov 14 '17 at 16:38

source share

tobias_k · Accepted Answer · 2017-11-14T19:20:13+0000

You can try the following: Determine the difference for each pair of consecutive numbers, and from those that determine the average difference.

nums = [101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 380, 377, 379, 382, 466, 469, 471, 472, 557, 559, 562, 566, 569] pairs = list(zip(nums, nums[1:])) diffs = [abs(xy) for x, y in pairs] avg_diff = sum(diffs) / len(diffs) # ~ 18.31

Now you can group the numbers by whether the difference with the previous number is lower or higher than average:

 groups = [[nums[0]]] # first group already has first number for (x, y), d in zip(pairs, diffs): if d < avg_diff: groups[-1].append(y) # add to last group else: groups.append([y]) # start new group

Or, if you prefer single-line lines spanning three lines, this might be for you:

 groups = [functools.reduce(lambda A, b: A+(b[1],) if A else b, group, None) for key, group in itertools.groupby(zip(nums, nums[1:]), key=lambda t: abs(t[0]-t[1]) < 18.3) if key]

The result for your example is this:

 [[101, 107, 106], [199, 204, 205, 207], [306, 310, 312, 312, 314, 317, 318], [380, 377, 379, 382], [466, 469, 471, 472], [557, 559, 562, 566, 569]]

Of course, this breaks down if there are groups with completely different group differences, as in [1, 4, 2, 5, 1042, 1230, 920, 3, 2, 5] . If so, you can try instead the relative difference of the numbers, for example. max(x,y)/min(x,y) instead of abs(xy) .

Group numbers in changing the value of an array in steps

More articles: