How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

I am trying to calculate the correlation matrix and filter correlations based on p-values ​​in order to find strongly correlated pairs.

To explain what I mean, let's say I have such a data frame.

df ABCD 0 2 NaN 2 -2 1 NaN 1 1 1.1 2 1 NaN NaN 3.2 3 -4 NaN 2 2 4 NaN 1 2.1 NaN 5 NaN 3 1 1 6 3 NaN 0 NaN 

For the correlation coefficient. I used pd.corr (). This method can process a data frame with NaN values, and, more importantly, it carries a pair of columns with 0 overlap (col A and col B):

 rho = df.corr() ABCD A 1.000000 NaN -0.609994 0.041204 B NaN 1.0 -0.500000 -1.000000 C -0.609994 -0.5 1.000000 0.988871 D 0.041204 -1.0 0.988871 1.000000 

The task is to calculate the p-value. I did not find a built-in method for this. However, from pandas, column correlation with statistical significance @BKay provided a loop to calculate the p-value. This method will complain of an error if there are less than 3 overlaps. So I made some changes by adding an exception exception.

ValueError: an array with zero size to the maximum reduction operation that does not have an identifier

 pval = rho.copy() for i in range(df.shape[1]): # rows are the number of rows in the matrix. for j in range(df.shape[1]): try: df_ols = pd.ols(y=df.iloc[:,i], x=df.iloc[:,j], intercept=True) pval.iloc[i,j] = df_ols.f_stat['p-value'] except ValueError: pval.iloc[i,j] = None pval ABCD A 0.000000 NaN 0.582343 0.973761 B NaN 0.000000 0.666667 NaN C 0.582343 0.666667 0.000000 0.011129 D 0.973761 NaN 0.011129 0.000000 

This method outputs the p-value matrix, but it becomes extremely slow when the size of the original data frame increases (my real data frame is ~ 5000 rows in 500 columns). What do you propose to do to get this p-value matrix efficiently for a large data frame.

+5
source share
2 answers

This question turned out to be a good solution.

+2
source

It seems that Pandas no longer supports OLS, so I changed the version a bit, which should give the same results:

 # Use this package for OLS import statsmodels.formula.api as sm pval = rho.copy() for i in range(df.shape[1]): # rows are the number of rows in the matrix. for j in range(df.shape[1]): try: y = df.columns[i] x = df.columns[j] df_ols = sm.ols(formula = 'Q("{}") ~ Q("{}")'.format(y,x), data = df).fit() pval.iloc[i,j] = df_ols.pvalues[1] except ValueError: pval.iloc[i,j] = None pval sns.heatmap(pval, center = 0, cmap="Blues", annot = True) plt.show() 
0
source

Source: https://habr.com/ru/post/1207584/


All Articles