Pandas column comparison with statistical significance

Question

Pandas column comparison with statistical significance

What is the best way, given pandas dataframe, df, to get the correlation between its columns df.1 and df.2 ?

I do not want the result to count the rows with NaN , which pandas does inline correlation. But I also want it to output a pvalue or a standard error that is not supported by the inline.

SciPy seems to have caught up with NaNs, although I believe it reports significance.

Sample data:

  1 2 0 2 NaN 1 NaN 1 2 1 2 3 -4 3 4 1.3 1 5 NaN NaN

+13

python scipy pandas correlation

wolfsatthedoor Aug 29 '14 at 15:54

source share

8 answers

You can use the correlation functions scipy.stats to get the p value.

For example, if you are looking for correlation, such as pearson correlation, you can use the pearsonr function.

 from scipy.stats import pearsonr pearsonr([1, 2, 3], [4, 3, 7])

Produces output

 (0.7205766921228921, 0.48775429164459994)

Where the first value in the tuple is the correlation value, and the second is the p-value.

In your case, you can use the pandas' dropna function to remove the NaN values first.

 df_clean = df[['column1', 'column2']].dropna() pearsonr(df_clean['column1'], df_clean['column2'])

+22

Shashank agarwal Aug 29 '14 at 16:13

source share

In order to calculate all p-values at once, you can use below calculate_pvalues functions:

 df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] }) calculate_pvalues(df)

The output is similar to corr() (but with p-values):

  ABC A 0 0.7877 0.1789 B 0.7877 0 0.6088 C 0.1789 0.6088 0

p-values rounded to 4 decimal places
Column D is ignored because it contains text.
You can also specify exact columns: calculate_pvalues(df[['A','B','C']]

The following is the function code :

 from scipy.stats import pearsonr import pandas as pd def calculate_pvalues(df): df = df.dropna()._get_numeric_data() dfcols = pd.DataFrame(columns=df.columns) pvalues = dfcols.transpose().join(dfcols, how='outer') for r in df.columns: for c in df.columns: pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4) return pvalues

+9

toto_tico Aug 4 '17 at 13:03

source share

 rho = df.corr() rho = rho.round(2) pval = calculate_pvalues(df) # toto_tico answer # create three masks r1 = rho.applymap(lambda x: '{}*'.format(x)) r2 = rho.applymap(lambda x: '{}**'.format(x)) r3 = rho.applymap(lambda x: '{}***'.format(x)) # apply them where appropriate rho = rho.mask(pval<=0.1,r1) rho = rho.mask(pval<=0.05,r2) rho = rho.mask(pval<=0.01,r3) rho # note I prefer readability over the conciseness of code, # instead of six lines it could have been a single liner like this: # [rho.mask(pval<=p,rho.applymap(lambda x: '{}*'.format(x)),inplace=True) for p in [.1,.05,.01]]

+8

tozCSS Feb 28 '18 at 23:28

source share

I tried to summarize the logic in a function, this may not be the most efficient approach, but it will provide you with a similar output like pandas df.corr (). To use this, simply put the following function in your code and call it, providing your dataframe object, i.e. corr_pvalue (your_dataframe).

I have rounded the values to 4 decimal places, in case you need a different output, please change the value in the round function.

 from scipy.stats import pearsonr import numpy as np import pandas as pd def corr_pvalue(df): numeric_df = df.dropna()._get_numeric_data() cols = numeric_df.columns mat = numeric_df.values arr = np.zeros((len(cols),len(cols)), dtype=object) for xi, x in enumerate(mat.T): for yi, y in enumerate(mat.T[xi:]): arr[xi, yi+xi] = map(lambda _: round(_,4), pearsonr(x,y)) arr[yi+xi, xi] = arr[xi, yi+xi] return pd.DataFrame(arr, index=cols, columns=cols)

I tested it with pandas v0.18.1

+1

Somendra joshi Dec 31 '16 at 7:11

source share

This was very useful code from oztalha . I just changed the formatting (rounded to 2 digits) wherever r was not significant.

  rho = data.corr() pval = calculate_pvalues(data) # toto_tico answer # create three masks r1 = rho.applymap(lambda x: '{:.2f}*'.format(x)) r2 = rho.applymap(lambda x: '{:.2f}**'.format(x)) r3 = rho.applymap(lambda x: '{:.2f}***'.format(x)) r4 = rho.applymap(lambda x: '{:.2f}'.format(x)) # apply them where appropriate --this could be a single liner rho = rho.mask(pval>0.1,r4) rho = rho.mask(pval<=0.1,r1) rho = rho.mask(pval<=0.05,r2) rho = rho.mask(pval<=0.01,r3) rho

0

user2730303 May 23 '18 at 9:25

source share

Great answers from @toto_tico and @ Somendra-joshi. However, it discards unnecessary NA values. In this snippet, I simply drop the NAs that are related to the correlation currently being calculated. In a real corr implementation, they do the same.

 def calculate_pvalues(df): df = df._get_numeric_data() dfcols = pd.DataFrame(columns=df.columns) pvalues = dfcols.transpose().join(dfcols, how='outer') for r in df.columns: for c in df.columns: if c == r: df_corr = df[[r]].dropna() else: df_corr = df[[r,c]].dropna() pvalues[r][c] = pearsonr(df_corr[r], df_corr[c])[1] return pvalues

0

Matheus araujo Jul 11 '18 at 14:30

source share

In pandas v0.24.0, the method argument was added to corr . Now you can do:

 import pandas as pd import numpy as np from scipy.stats import pearsonr df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]}) df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))

  ABC A 0.000000 0.787704 0.178912 B 0.787704 0.000000 0.608792 C 0.178912 0.608792 0.000000

Pay attention to the workaround with np.eye(len(df.columns)) which is necessary because self-correlations are always set to 1.0 (see https://github.com/pandas-dev/pandas/issues/25726 ).

0

Fabian rost Mar 14 '19 at 15:55

source share

Bkay · Accepted Answer · 2014-08-29T16:43:48+0000

The answer provided by @Shashank is good. However, if you want a solution in pure pandas , you might like the following:

 import pandas as pd from pandas.io.data import DataReader from datetime import datetime import scipy.stats as stats gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1))) vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1))) #Do it with a pandas regression to get the p value from the F-test df = gdp.merge(vix,left_index=True, right_index=True, how='left') vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True) print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])

Results:

 -0.0422917932738 0.851762475093

The same results as the statistics function:

 #Do it with stats functions. df_clean = df.dropna() stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])

Results:

  (-0.042291793273791969, 0.85176247509284908)

To expand access to large quantities, I give you an ugly loop-based approach:

 #Add a third field oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) df = df.merge(oil,left_index=True, right_index=True, how='left') #construct two arrays, one of the correlation and the other of the p-vals rho = df.corr() pval = np.zeros([df.shape[1],df.shape[1]]) for i in range(df.shape[1]): # rows are the number of rows in the matrix. for j in range(df.shape[1]): JonI = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True) pval[i,j] = JonI.f_stat['p-value']

Rho results:

  GDP VIXCLS DCOILWTICO GDP 1.000000 -0.042292 0.870251 VIXCLS -0.042292 1.000000 -0.004612 DCOILWTICO 0.870251 -0.004612 1.000000

Pval results:

  [[ 0.00000000e+00 8.51762475e-01 1.11022302e-16] [ 8.51762475e-01 0.00000000e+00 9.83747425e-01] [ 1.11022302e-16 9.83747425e-01 0.00000000e+00]]

Pandas column comparison with statistical significance

The following is the function code :

More articles: