How to check the correlation between comparable columns of two datasets?

Question

How to check the correlation between comparable columns of two datasets?

If we have a dataset:

import pandas as pd a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}) b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})

How to create a correlation matrix in which the y axis represents "a" and the x axis represents "b"?

The goal is to see the correlations between the corresponding columns of the two data sets as follows:

+2

python numpy pandas correlation

ishido Dec 6 '16 at 21:09

source share

4 answers

If you don't mind the NumPy based vector solution based on this solution post until Computing the correlation coefficient between two multi-dimensional arrays -

 corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.

Run Example -

 In [621]: a Out[621]: ABCDE 0 34 54 56 0 78 1 12 87 78 23 12 2 78 35 0 72 31 3 84 25 14 56 0 4 26 82 13 14 34 In [622]: b Out[622]: ABCDE 0 45 45 98 0 24 1 24 87 52 23 12 2 65 65 32 1 65 3 65 52 32 365 3 4 65 12 12 53 65 In [623]: corr2_coeff(a.values.T,b.values.T).T Out[623]: array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011], [ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737], [-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413], [ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866], [ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])

+3

Divakar Dec 6 '16 at 21:47

source share

Do you need to use Pandas? It seems that this can be done through numpy quite easily. Did I misunderstand the task?

  import numpy X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]} Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]} for key,value in X.items(): print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

0

SeedofWInd Dec 6 '16 at 21:21

source share

I use this function which splits it into numpy

 def corr_ab(a, b): a_ = a.values b_ = b.values ab = a_.T.dot(b_) n = len(a) sums_squared = np.outer(a_.sum(0), b_.sum(0)) stds_squared = np.outer(a_.std(0), b_.std(0)) return pd.DataFrame((ab - sums_squared / n) / stds_squared / n, a.columns, b.columns)

demo

 corr_ab(a, b)

0

piRSquared Dec 7 '16 at 0:23

source share

blacksite · Accepted Answer · 2016-12-06T21:19:25+0000

This is achieved exactly by what you want:

 from scipy.stats import pearsonr # create a new DataFrame where the values for the indices and columns # align on the diagonals c = pd.DataFrame(columns = a.columns, index = a.columns) # since we know set(a.columns) == set(b.columns), we can just iterate # through the columns in a (although a more robust way would be to iterate # through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up for col in a.columns: correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient

Edit: Well, he achieved what you wanted, until the question was changed. Although this can be easily changed:

 c = pd.DataFrame(columns = a.columns, index = a.columns) for col in c.columns: for idx in c.index: correl_signif = pearsonr(a[col], b[idx]) correl = correl_signif[0] c.loc[idx, col] = correl

c now this:

 Out[16]: ABCDE A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101 B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427 C -0.266264 -0.0198347 0.661107 -0.50872 0.683504 D 0.580956 -0.552312 -0.320539 0.384165 -0.624039 E 0.0165272 0.140005 -0.582389 0.12936 0.286023

How to check the correlation between comparable columns of two datasets?

More articles: