A simpler method was updated at the end of the question.
What i have
I have a user user correlation matrix called matrixcorr_of_user , as shown below:
userId 316 320 359 370 910 userId 316 1.000000 0.202133 0.208618 0.176050 0.174035 320 0.202133 1.000000 0.242837 0.019035 0.031737 359 0.208618 0.242837 1.000000 0.357620 0.175914 370 0.176050 0.019035 0.357620 1.000000 0.317371 910 0.174035 0.031737 0.175914 0.317371 1.000000
What I want
For each user, I just want the 2 other users to be most similar to him (the highest correlation values ββfor each row after excluding diagonal elements). For instance:
Out[40]: userId 316 320 359 370 910 corr_user 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN
I know how to achieve this, but the way I came up with is too complicated. Can anyone suggest a better idea?
What i tried
First melt matrix:
melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user") melted_corr.head() Out[23]: userId corr_user value 0 316 316 1.000000 1 320 316 0.202133 2 359 316 0.208618 3 370 316 0.176050 4 910 316 0.174035
filter is line by line:
get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3] filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third) filted Out[39]: corr_user userId 316 359 0.208618 320 0.202133 320 359 0.242837 316 0.202133 359 370 0.357620 320 0.242837 370 359 0.357620 910 0.317371 910 370 0.317371 359 0.175914
and finally reshape it:
filted.reset_index().pivot_table("value","corr_user","userId") Out[40]: userId 316 320 359 370 910 corr_user 316 NaN 0.202133 0.208618 NaN NaN 320 0.202133 NaN 0.242837 NaN NaN 359 NaN 0.242837 NaN 0.357620 NaN 370 NaN NaN 0.357620 NaN 0.317371 910 NaN NaN 0.175914 0.317371 NaN
Updated:
I came up with an easier way to do this after I saw @John Zwinck's answer
Let's say there is a new df matrix with some duplicated value and NaN
userId 316 320 359 370 910 userId 316 1.0 0.500000 0.500000 0.500000 NaN 320 0.5 1.000000 0.242837 0.019035 0.031737 359 0.5 0.242837 1.000000 0.357620 0.175914 370 0.5 0.019035 0.357620 1.000000 0.317371 910 NaN 0.031737 0.175914 0.317371 1.000000
First I get the rank each line.
rank = df.rank(1, ascending=False, method="first")
Then I use df.isin() to get the required mask.
mask = rank.isin(list(range(2,4)))
Finally
df.where(mask)
Then I want, I want.
userId 316 320 359 370 910 userId 316 NaN 0.5 0.500000 NaN NaN 320 0.5 NaN 0.242837 NaN NaN 359 0.5 NaN NaN 0.357620 NaN 370 0.5 NaN 0.357620 NaN NaN 910 NaN NaN 0.175914 0.317371 NaN