Pandas column comparison with statistical significance

What is the best way, given pandas dataframe, df, to get the correlation between its columns df.1 and df.2 ?

I do not want the result to count the rows with NaN , which pandas does inline correlation. But I also want it to output a pvalue or a standard error that is not supported by the inline.

SciPy seems to have caught up with NaNs, although I believe it reports significance.

Sample data:

  1 2 0 2 NaN 1 NaN 1 2 1 2 3 -4 3 4 1.3 1 5 NaN NaN 
+13
source share
8 answers

The answer provided by @Shashank is good. However, if you want a solution in pure pandas , you might like the following:

 import pandas as pd from pandas.io.data import DataReader from datetime import datetime import scipy.stats as stats gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1))) vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1))) #Do it with a pandas regression to get the p value from the F-test df = gdp.merge(vix,left_index=True, right_index=True, how='left') vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True) print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value']) 

Results:

 -0.0422917932738 0.851762475093 

The same results as the statistics function:

 #Do it with stats functions. df_clean = df.dropna() stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP']) 

Results:

  (-0.042291793273791969, 0.85176247509284908) 

To expand access to large quantities, I give you an ugly loop-based approach:

 #Add a third field oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) df = df.merge(oil,left_index=True, right_index=True, how='left') #construct two arrays, one of the correlation and the other of the p-vals rho = df.corr() pval = np.zeros([df.shape[1],df.shape[1]]) for i in range(df.shape[1]): # rows are the number of rows in the matrix. for j in range(df.shape[1]): JonI = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True) pval[i,j] = JonI.f_stat['p-value'] 

Rho results:

  GDP VIXCLS DCOILWTICO GDP 1.000000 -0.042292 0.870251 VIXCLS -0.042292 1.000000 -0.004612 DCOILWTICO 0.870251 -0.004612 1.000000 

Pval results:

  [[ 0.00000000e+00 8.51762475e-01 1.11022302e-16] [ 8.51762475e-01 0.00000000e+00 9.83747425e-01] [ 1.11022302e-16 9.83747425e-01 0.00000000e+00]] 
+10
source

You can use the correlation functions scipy.stats to get the p value.

For example, if you are looking for correlation, such as pearson correlation, you can use the pearsonr function.

 from scipy.stats import pearsonr pearsonr([1, 2, 3], [4, 3, 7]) 

Produces output

 (0.7205766921228921, 0.48775429164459994) 

Where the first value in the tuple is the correlation value, and the second is the p-value.

In your case, you can use the pandas' dropna function to remove the NaN values ​​first.

 df_clean = df[['column1', 'column2']].dropna() pearsonr(df_clean['column1'], df_clean['column2']) 
+22
source

In order to calculate all p-values ​​at once, you can use below calculate_pvalues functions:

 df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] }) calculate_pvalues(df) 
  • The output is similar to corr() (but with p-values):

      ABC A 0 0.7877 0.1789 B 0.7877 0 0.6088 C 0.1789 0.6088 0 
  • p-values rounded to 4 decimal places

  • Column D is ignored because it contains text.
  • You can also specify exact columns: calculate_pvalues(df[['A','B','C']]

The following is the function code :

 from scipy.stats import pearsonr import pandas as pd def calculate_pvalues(df): df = df.dropna()._get_numeric_data() dfcols = pd.DataFrame(columns=df.columns) pvalues = dfcols.transpose().join(dfcols, how='outer') for r in df.columns: for c in df.columns: pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4) return pvalues 
+9
source
 rho = df.corr() rho = rho.round(2) pval = calculate_pvalues(df) # toto_tico answer # create three masks r1 = rho.applymap(lambda x: '{}*'.format(x)) r2 = rho.applymap(lambda x: '{}**'.format(x)) r3 = rho.applymap(lambda x: '{}***'.format(x)) # apply them where appropriate rho = rho.mask(pval<=0.1,r1) rho = rho.mask(pval<=0.05,r2) rho = rho.mask(pval<=0.01,r3) rho # note I prefer readability over the conciseness of code, # instead of six lines it could have been a single liner like this: # [rho.mask(pval<=p,rho.applymap(lambda x: '{}*'.format(x)),inplace=True) for p in [.1,.05,.01]] 

Correlations with asterisks

+8
source

I tried to summarize the logic in a function, this may not be the most efficient approach, but it will provide you with a similar output like pandas df.corr (). To use this, simply put the following function in your code and call it, providing your dataframe object, i.e. corr_pvalue (your_dataframe).

I have rounded the values ​​to 4 decimal places, in case you need a different output, please change the value in the round function.

 from scipy.stats import pearsonr import numpy as np import pandas as pd def corr_pvalue(df): numeric_df = df.dropna()._get_numeric_data() cols = numeric_df.columns mat = numeric_df.values arr = np.zeros((len(cols),len(cols)), dtype=object) for xi, x in enumerate(mat.T): for yi, y in enumerate(mat.T[xi:]): arr[xi, yi+xi] = map(lambda _: round(_,4), pearsonr(x,y)) arr[yi+xi, xi] = arr[xi, yi+xi] return pd.DataFrame(arr, index=cols, columns=cols) 

I tested it with pandas v0.18.1

+1
source

This was very useful code from oztalha . I just changed the formatting (rounded to 2 digits) wherever r was not significant.

  rho = data.corr() pval = calculate_pvalues(data) # toto_tico answer # create three masks r1 = rho.applymap(lambda x: '{:.2f}*'.format(x)) r2 = rho.applymap(lambda x: '{:.2f}**'.format(x)) r3 = rho.applymap(lambda x: '{:.2f}***'.format(x)) r4 = rho.applymap(lambda x: '{:.2f}'.format(x)) # apply them where appropriate --this could be a single liner rho = rho.mask(pval>0.1,r4) rho = rho.mask(pval<=0.1,r1) rho = rho.mask(pval<=0.05,r2) rho = rho.mask(pval<=0.01,r3) rho 
0
source

Great answers from @toto_tico and @ Somendra-joshi. However, it discards unnecessary NA values. In this snippet, I simply drop the NAs that are related to the correlation currently being calculated. In a real corr implementation, they do the same.

 def calculate_pvalues(df): df = df._get_numeric_data() dfcols = pd.DataFrame(columns=df.columns) pvalues = dfcols.transpose().join(dfcols, how='outer') for r in df.columns: for c in df.columns: if c == r: df_corr = df[[r]].dropna() else: df_corr = df[[r,c]].dropna() pvalues[r][c] = pearsonr(df_corr[r], df_corr[c])[1] return pvalues 
0
source

In pandas v0.24.0, the method argument was added to corr . Now you can do:

 import pandas as pd import numpy as np from scipy.stats import pearsonr df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]}) df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns)) 
  ABC A 0.000000 0.787704 0.178912 B 0.787704 0.000000 0.608792 C 0.178912 0.608792 0.000000 

Pay attention to the workaround with np.eye(len(df.columns)) which is necessary because self-correlations are always set to 1.0 (see https://github.com/pandas-dev/pandas/issues/25726 ).

0
source

Source: https://habr.com/ru/post/974556/


All Articles