The returned groups of correlated columns in the pandas data frame

Question

The returned groups of correlated columns in the pandas data frame

I ran the correlation matrix on the pandas DataFrame :

 df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]}) corrMatrix=df.corr() corrMatrix drive four one six three two zive drive 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75 four -0.04 1.00 -0.49 -0.04 -0.04 0.16 -0.49 one -0.75 -0.49 1.00 -0.75 -0.75 -0.35 1.00 six 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75 three 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75 two 0.24 0.16 -0.35 0.24 0.24 1.00 -0.35 zive -0.75 -0.49 1.00 -0.75 -0.75 -0.35 1.00

Now I want to write some code to return columns that are perfectly correlated (i.e. correlation == 1) in groups.

Optimally, I would like: [['zive', 'one'], ['three', 'six', 'drive']]

I wrote the code below that gives me ['drive', 'one', 'six', 'three', 'zive'] , but as you can see, this is just a bag with columns that have some kind of perfect correlation with some other column - this does not put them in a distinctive grouping with their fully correlated cousin columns.

 correlatedCols=[] for col in corrMatrix: data=corrMatrix[col][corrMatrix[col]==1] if len(data)>1: correlatedCols.append(data.name) correlatedCols ['drive','one', 'six', 'three', 'zive']

EDIT: Using @Karl D.'s recommendations, I get the following:

 cor = df.corr() cor.loc[:,:] = np.tril(cor.values, k=-1) cor = cor.stack() cor[cor ==1] six drive 1.00 three drive 1.00 six 1.00 zive one 1.00

.. This is not exactly what I want - since [six, drive] not a grouping - there is no 'three' .

+6

python numpy pandas

Bryan Jun 2 '14 at 20:35

source share

2 answers

You can do something like the following:

 >>> cor = df.corr() >>> cor.loc[:,:] = np.tril(cor, k=-1) >>> cor = cor.stack() >>> cor[cor > 0.9999] three six 1 zive one 1

To more accurately match the expected result, you can do something like the following:

 >>> cor[cor > 0.9999].to_dict().keys() [('zive', 'one'), ('three', 'six')]

Explanation. First, I create a lower triangular version of the covariance matrix that excludes the diagonal (using numpy tril ):

 >>> cor.loc[:,:] = np.tril(cor.values, k=-1) four one six three two zive four 0.000000 -0.000000 -0.000000 -0.000000 0.000000 -0 one -0.489177 0.000000 -0.000000 -0.000000 -0.000000 0 six -0.039607 -0.747365 0.000000 0.000000 0.000000 -0 three -0.039607 -0.747365 1.000000 0.000000 0.000000 -0 two 0.159583 -0.351531 0.238102 0.238102 0.000000 -0 zive -0.489177 1.000000 -0.747365 -0.747365 -0.351531 0

And then I add the dataframe:

 >>> cor = cor.stack() four four 0.000000 one -0.000000 six -0.000000 three -0.000000 two 0.000000 zive -0.000000 one four -0.489177 one 0.000000 six -0.000000 three -0.000000 two -0.000000 zive 0.000000 six four -0.039607 one -0.747365 six 0.000000 three 0.000000 two 0.000000 zive -0.000000 three four -0.039607 one -0.747365 six 1.000000 three 0.000000 two 0.000000 zive -0.000000 two four 0.159583 one -0.351531 six 0.238102 three 0.238102 two 0.000000 zive -0.000000 zive four -0.489177 one 1.000000 six -0.747365 three -0.747365 two -0.351531 zive 0.000000

And then I can just grab strings equal to one.

Change I think this will get the form you want, but it is not elegant:

 >>> from itertools import chain >>> cor.loc[:,:] = np.tril(cor, k=-1) >>> cor = cor.stack() >>> ones = cor[cor > 0.999].reset_index().loc[:,['level_0','level_1']] >>> ones = ones.query('level_0 not in level_1') >>> ones.groupby('level_0').agg(lambda x: set(chain(x.level_0,x.level_1))).values [[set(['six', 'drive', 'three'])] [set(['zive', 'one'])]]

+9

Karl D. Jun 2 '14 at 21:11

source share

Akavall · Accepted Answer · 2014-06-02T21:46:48+0000

Here is a naive approach:

 df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]}) corrMatrix=df.corr() corrMatrix.loc[:,:] = np.tril(corrMatrix, k=-1) # borrowed from Karl D answer already_in = set() result = [] for col in corrMatrix: perfect_corr = corrMatrix[col][corrMatrix[col] == 1].index.tolist() if perfect_corr and col not in already_in: already_in.update(set(perfect_corr)) perfect_corr.append(col) result.append(perfect_corr)

Result:

 >>> result [['six', 'three', 'drive'], ['zive', 'one']]

The returned groups of correlated columns in the pandas data frame

More articles: