Is there an elegant way to save only the top value [2 ~ 3] for each row in the matrix?

A simpler method was updated at the end of the question.

What i have

I have a user user correlation matrix called matrixcorr_of_user , as shown below:

 userId 316 320 359 370 910 userId 316 1.000000 0.202133 0.208618 0.176050 0.174035 320 0.202133 1.000000 0.242837 0.019035 0.031737 359 0.208618 0.242837 1.000000 0.357620 0.175914 370 0.176050 0.019035 0.357620 1.000000 0.317371 910 0.174035 0.031737 0.175914 0.317371 1.000000 

What I want

For each user, I just want the 2 other users to be most similar to him (the highest correlation values ​​for each row after excluding diagonal elements). For instance:

 Out[40]: userId 316 320 359 370 910 corr_user 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN 

I know how to achieve this, but the way I came up with is too complicated. Can anyone suggest a better idea?

What i tried

First melt matrix:

 melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user") melted_corr.head() Out[23]: userId corr_user value 0 316 316 1.000000 1 320 316 0.202133 2 359 316 0.208618 3 370 316 0.176050 4 910 316 0.174035 

filter is line by line:

 get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3] filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third) filted Out[39]: corr_user userId 316 359 0.208618 320 0.202133 320 359 0.242837 316 0.202133 359 370 0.357620 320 0.242837 370 359 0.357620 910 0.317371 910 370 0.317371 359 0.175914 

and finally reshape it:

 filted.reset_index().pivot_table("value","corr_user","userId") Out[40]: userId 316 320 359 370 910 corr_user 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN 

Updated:

I came up with an easier way to do this after I saw @John Zwinck's answer

Let's say there is a new df matrix with some duplicated value and NaN

 userId 316 320 359 370 910 userId 316 1.0 0.500000 0.500000 0.500000 NaN 320 0.5 1.000000 0.242837 0.019035 0.031737 359 0.5 0.242837 1.000000 0.357620 0.175914 370 0.5 0.019035 0.357620 1.000000 0.317371 910 NaN 0.031737 0.175914 0.317371 1.000000 

First I get the rank each line.

 rank = df.rank(1, ascending=False, method="first") 

Then I use df.isin() to get the required mask.

 mask = rank.isin(list(range(2,4))) 

Finally

df.where(mask)

Then I want, I want.

 userId 316 320 359 370 910 userId 316 NaN 0.5 0.500000 NaN NaN 320 0.5 NaN 0.242837 NaN NaN 359 0.5 NaN NaN 0.357620 NaN 370 0.5 NaN 0.357620 NaN NaN 910 NaN NaN 0.175914 0.317371 NaN 
+5
source share
3 answers

First use np.argsort() to find which locations have the highest values:

 sort = np.argsort(df) 

This gives a DataFrame whose column names are meaningless, but the second and third columns on the right contain the required indexes on each row:

  316 320 359 370 910 userId 316 4 3 1 2 0 320 3 4 0 2 1 359 4 0 1 3 2 370 1 0 4 2 3 910 1 0 2 3 4 

Then create a boolean mask that says true:

 mask = np.zeros(df.shape, bool) rows = np.arange(len(df)) mask[rows, sort.iloc[:,-2]] = True mask[rows, sort.iloc[:,-3]] = True 

Now you have the necessary mask:

 array([[False, True, True, False, False], [ True, False, True, False, False], [False, True, False, True, False], [False, False, True, False, True], [False, False, True, True, False]], dtype=bool) 

Finally, df.where(mask) :

  316 320 359 370 910 userId 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN 
+4
source

This should work:

 melted_corr['group_rank']=melted_corr.groupby('userId')['value']\ .rank(ascending=False) 

then select the top for each user with:

 melted_corr[melted_corr.group_rank<=2] 
+1
source

Here is my numpy-esque solution:

 top_k = 3 top_corr = corr_of_user.copy() top_ndarray = top_corr.values np.fill_diagonal(top_ndarray, np.NaN) rows = np.arange(top_corr.shape[0])[:, np.newaxis] columns = top_ndarray.argsort()[:, :-top_k] top_ndarray[rows, columns] = np.NaN top_corr 

And get

 userId 316 320 359 370 910 userId 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN 

You can replace top_corr = corr_of_user.copy() with top_corr = corr_of_user if you don't need a copy, but rather a solution in place.

The ideas are pretty much the same as John Zwink's idea of ​​getting the index of the required fields and using it to index into an array and clear values ​​that we don't need. A small advantage of my solution is that K (the number of best results we want) is a parameter, not hardcoded. It also works when corr_of_user has all 1 s.

+1
source

Source: https://habr.com/ru/post/1273576/


All Articles