“The problem is that each variable corresponds to the answer to the survey, and not every questionnaire answered every question. Thus, I want some measure of how the answer to question 2, say, affects the probability of answers to question 8 , eg.
This is a missing data problem. I think it confuses people that you continue to refer to your samples as having different lengths. I think you can visualize them like this:
sample 1:
question number: [1,2,3,4,5] response : [1,0,1,1,0]
sample 2:
question number: [2,4,5] response : [1,1,0]
when sample 2 should be something like this:
question number: [ 1,2, 3,4,5] response : [NaN,1,NaN,1,0]
This is a question number, not the number of questions that it will answer. Without answering the questions, it is impossible to calculate something like a covariance matrix.
Anyway, this function numpy.ma.cov , which is mentioned by ddodev , calculates the covariance, using the fact that the elements that are summed depend on only two values.
Thus, he calculates those that he can. Then, when it comes to the stage of dividing by n, it divides by the number of values that were calculated (for this particular covvariance-matrix element), instead of the total number of samples.