I do not find a function that computes the matrix of correlation coefficients for arrays containing observations for more than two variables when there is NaN in the data. There are functions that do this for pairs of variables (or just mask arrays using ~ is.nan ()). But using these functions, going over a large number of variables, calculating the correlation for each pair can be very time-consuming.
Therefore, I tried it myself and soon realized that the complexity of this is the question of the correct normalization of covariance. I would be very interested in your opinions on how to do this.
Here is the code:
def nancorr(X,nanfact=False): X = X - np.nanmean(X,axis=1,keepdims = True)*np.ones((1,X.shape[1])) if nanfact: mask = np.isnan(X).astype(int) fact = X.shape[1] - np.dot(mask,mask.T) - 1 X[np.isnan(X)] = 0 if nanfact: cov = np.dot(X,XT)/fact else: cov = np.dot(X,XT) d = np.diag(cov) return cov/np.sqrt(np.multiply.outer(d,d))
The function assumes that each row is a variable. This is basically the corrected code from numpy corrcoeff (). I believe there are three ways to do this:
(1) For each pair of variables, you take only those observations for which neither one nor the other is NaN. This is perhaps the most accurate, but also the most difficult to program if you want to do the calculation for more than one pair at a time and are not covered in the above code. Why, however, throw out information about the average and variance of each variable, only because the corresponding input of the other variable is NaN? Therefore, two other options.
(2) We evaluate each variable using nanmean, and the variance of each variable is its nanurizability. For covariance, each observation, where one or the other is a NaN variable, but not both, is a non-covariance observation and, therefore, is set to zero. The covariance coefficient is then 1 / (# observations, where not both variables are NaN - 1), denoted by n. Both variances in the denominator of the correlation coefficient are taken into account by their corresponding number of observations without NaN minus 1, denoted by n1 and n2, respectively. This is achieved by setting nanfact = True in the above function.
(3) It would be desirable that the covariance and variance have the same coefficient as the correlation coefficient without NaN. The only meaningful way to do this here (if option (1) is not feasible) is to simply ignore (1 / n) / sqrt (1 / n1 * n2). Since this number is less than unity, the estimated correlation coefficients will be larger (in absolute value) than in (2), but will remain between -1.1. This is achieved by setting nanfact = False.
I would be very interested in your opinion on approaches (2) and (3), and especially, I would very much like to see solution (1) without using loops.