Is there an elegant way to save only the top value [2 ~ 3] for each row in the matrix?

Question

Is there an elegant way to save only the top value [2 ~ 3] for each row in the matrix?

A simpler method was updated at the end of the question.

What i have

I have a user user correlation matrix called matrixcorr_of_user , as shown below:

 userId 316 320 359 370 910 userId 316 1.000000 0.202133 0.208618 0.176050 0.174035 320 0.202133 1.000000 0.242837 0.019035 0.031737 359 0.208618 0.242837 1.000000 0.357620 0.175914 370 0.176050 0.019035 0.357620 1.000000 0.317371 910 0.174035 0.031737 0.175914 0.317371 1.000000

What I want

For each user, I just want the 2 other users to be most similar to him (the highest correlation values for each row after excluding diagonal elements). For instance:

 Out[40]: userId 316 320 359 370 910 corr_user 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN

I know how to achieve this, but the way I came up with is too complicated. Can anyone suggest a better idea?

What i tried

First melt matrix:

 melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user") melted_corr.head() Out[23]: userId corr_user value 0 316 316 1.000000 1 320 316 0.202133 2 359 316 0.208618 3 370 316 0.176050 4 910 316 0.174035

filter is line by line:

 get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3] filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third) filted Out[39]: corr_user userId 316 359 0.208618 320 0.202133 320 359 0.242837 316 0.202133 359 370 0.357620 320 0.242837 370 359 0.357620 910 0.317371 910 370 0.317371 359 0.175914

and finally reshape it:

 filted.reset_index().pivot_table("value","corr_user","userId") Out[40]: userId 316 320 359 370 910 corr_user 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN

Updated:

I came up with an easier way to do this after I saw @John Zwinck's answer

Let's say there is a new df matrix with some duplicated value and NaN

 userId 316 320 359 370 910 userId 316 1.0 0.500000 0.500000 0.500000 NaN 320 0.5 1.000000 0.242837 0.019035 0.031737 359 0.5 0.242837 1.000000 0.357620 0.175914 370 0.5 0.019035 0.357620 1.000000 0.317371 910 NaN 0.031737 0.175914 0.317371 1.000000

First I get the rank each line.

 rank = df.rank(1, ascending=False, method="first")

Then I use df.isin() to get the required mask.

 mask = rank.isin(list(range(2,4)))

Finally

df.where(mask)

Then I want, I want.

 userId 316 320 359 370 910 userId 316 NaN 0.5 0.500000 NaN NaN 320 0.5 NaN 0.242837 NaN NaN 359 0.5 NaN NaN 0.357620 NaN 370 0.5 NaN 0.357620 NaN NaN 910 NaN NaN 0.175914 0.317371 NaN

+5

python sorting pandas dataframe

Dawei Nov 22 '17 at 12:36

source share

3 answers

This should work:

 melted_corr['group_rank']=melted_corr.groupby('userId')['value']\ .rank(ascending=False)

then select the top for each user with:

 melted_corr[melted_corr.group_rank<=2]

+1

ags29 Nov 22 '17 at 12:53

source share

Here is my numpy-esque solution:

 top_k = 3 top_corr = corr_of_user.copy() top_ndarray = top_corr.values np.fill_diagonal(top_ndarray, np.NaN) rows = np.arange(top_corr.shape[0])[:, np.newaxis] columns = top_ndarray.argsort()[:, :-top_k] top_ndarray[rows, columns] = np.NaN top_corr

And get

 userId 316 320 359 370 910 userId 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN

You can replace top_corr = corr_of_user.copy() with top_corr = corr_of_user if you don't need a copy, but rather a solution in place.

The ideas are pretty much the same as John Zwink's idea of getting the index of the required fields and using it to index into an array and clear values that we don't need. A small advantage of my solution is that K (the number of best results we want) is a parameter, not hardcoded. It also works when corr_of_user has all 1 s.

+1

Svetlin mladenov Nov 22 '17 at 14:11

source share

John zwinck · Accepted Answer · 2017-11-22T13:16:01+0000

First use np.argsort() to find which locations have the highest values:

 sort = np.argsort(df)

This gives a DataFrame whose column names are meaningless, but the second and third columns on the right contain the required indexes on each row:

  316 320 359 370 910 userId 316 4 3 1 2 0 320 3 4 0 2 1 359 4 0 1 3 2 370 1 0 4 2 3 910 1 0 2 3 4

Then create a boolean mask that says true:

 mask = np.zeros(df.shape, bool) rows = np.arange(len(df)) mask[rows, sort.iloc[:,-2]] = True mask[rows, sort.iloc[:,-3]] = True

Now you have the necessary mask:

 array([[False, True, True, False, False], [ True, False, True, False, False], [False, True, False, True, False], [False, False, True, False, True], [False, False, True, True, False]], dtype=bool)

Finally, df.where(mask) :

  316 320 359 370 910 userId 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN

Is there an elegant way to save only the top value [2 ~ 3] for each row in the matrix?

What i have

What I want

What i tried

Updated:

More articles: