Compare multiple columns in a numpy array

Question

Compare multiple columns in a numpy array

I have a two-dimensional numpy array containing about 12 columns and 1000+ rows, and each cell contains a number from 1 to 5. I am looking for the best column segment according to my point system, where 1 and 2 generate -1 point and 4 and 5 gives +1.

If a line in a certain segment contains, for example, [1, 4, 5, 3, 4, 3], the point for this line must be +2, because 3 * 1 + 1 * (- 1) = 2. The next line can be [1, 2, 2, 3, 3, 3] and should be -3 points.

At first I tried a solution with a direct lattice, but I realized that for comparison you can use 665,280 possible combinations of columns, and when I also need to look for the best fivefold, quadruple, etc., the cycle takes forever.

Is there a smarter way to solve my problem?

+4

python numpy compare

user1649268 Sep 05 '12 at 14:39

source share

3 answers

unutbu · Answer 1 · 2012-09-05T15:18:33+0000

import numpy as np import itertools N_rows = 10 arr = np.random.random_integers(5, size=(N_rows,12)) x = np.array([0,-1,-1,0,1,1]) y = x[arr] print(y) score, best_sextuple = max((y[:,cols].sum(), cols) for cols in itertools.combinations(range(12),6)) print('''\ score: {s} sextuple: {c} '''.format(s = score, c = best_sextuple))

gives for example

 score: 6 sextuple: (0, 1, 5, 8, 10, 11)

Explanation

First, create a random example with 12 columns and 10 rows:

 N_rows = 10 arr = np.random.random_integers(5, size=(N_rows,12))

Now we can use numpy indexing to convert numbers in arr 1,2, ..., 5 to -1,0,1 (according to your scoring system):

 x = np.array([0,-1,-1,0,1,1]) y = x[arr]

Then, use itertools.combinations to create all the possible combinations of 6 columns:

 for cols in itertools.combinations(range(12),6)

and

 y[:,cols].sum()

then gives an estimate for cols , the choice of columns (sextula).

Finally, use max to select the segment with the best result:

 score, best_sextuple = max((y[:,cols].sum(), cols) for cols in itertools.combinations(range(12),6))

chthonicdaemon · Answer 2 · 2012-09-05T15:47:58+0000

 import numpy A = numpy.random.randint(1, 6, size=(1000, 12)) points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5) columnsums = numpy.sum(points, 0) def best6(row): return numpy.argsort(row)[-6:] bestcolumns = best6(columnsums) allbestcolumns = map(best6, points)

bestcolumns will now contain the top 6 columns in ascending order. By similar logic, allbestcolumns will contain the top six columns in each row.

abought · Answer 3 · 2012-09-05T15:25:54+0000

Expanding the longer unutbu answer above, it can automatically generate a masked array of points. Since your estimates for the values are consistent, each pass goes through a cycle, so estimates for each value need to be calculated only once. Here's a slightly inelegant way of doing this using the 6x10 array as an example, before and after your ratings.

 >>> import numpy >>> values = numpy.random.randint(6, size=(6,10)) >>> values array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4], [2, 5, 2, 2, 3, 1, 3, 5, 3, 1], [3, 3, 5, 4, 2, 1, 4, 0, 0, 1], [2, 4, 0, 0, 4, 1, 4, 0, 1, 0], [0, 4, 1, 2, 0, 3, 3, 5, 0, 1], [2, 3, 3, 4, 0, 1, 1, 1, 3, 2]]) >>> b = values.copy() >>> b[ b<3 ] = -1 >>> b[ b==3 ] = 0 >>> b[ b>3 ] = 1 >>> b array([[ 1, 1, -1, -1, -1, 1, -1, -1, -1, 1], [-1, 1, -1, -1, 0, -1, 0, 1, 0, -1], [ 0, 0, 1, 1, -1, -1, 1, -1, -1, -1], [-1, 1, -1, -1, 1, -1, 1, -1, -1, -1], [-1, 1, -1, -1, -1, 0, 0, 1, -1, -1], [-1, 0, 0, 1, -1, -1, -1, -1, 0, -1]])

By the way, this thread claims that creating combinations directly in numpy will give about 5 times the performance than itertools, although, perhaps, due to some readability.

Compare multiple columns in a numpy array

More articles: