Approximation of covariance for arrays of various sizes

Are there any common tools in NumPy / SciPy for calculating a correlation measure that works even if the input variables are of different sizes? In the standard formulation of covariance and correlation, it is required to have the same number of observations for each test variable. Typically, you should pass a matrix where each row is a different variable, and each column is a separate observation.

In my case, I have 9 different variables, but for each variable the number of observations is not constant. Some variables have more observations than others. I know that there are areas such as sensor merging that study such problems, so what standard tools exist for calculating relational statistics for data series of different lengths (preferably in Python)?

+4
source share
3 answers

From a purely mathematical point of view, I believe that they should be the same. To make them the same, you can apply some concepts related to the problem of lack of data . I think I'm saying that this is more likely not covariance if the vectors are not the same size. No matter which tool you use, just create some moments in some smart way to make vectors of equal length.

+2
source

“The problem is that each variable corresponds to the answer to the survey, and not every questionnaire answered every question. Thus, I want some measure of how the answer to question 2, say, affects the probability of answers to question 8 , eg.

This is a missing data problem. I think it confuses people that you continue to refer to your samples as having different lengths. I think you can visualize them like this:

sample 1:

question number: [1,2,3,4,5] response : [1,0,1,1,0] 

sample 2:

 question number: [2,4,5] response : [1,1,0] 

when sample 2 should be something like this:

 question number: [ 1,2, 3,4,5] response : [NaN,1,NaN,1,0] 

This is a question number, not the number of questions that it will answer. Without answering the questions, it is impossible to calculate something like a covariance matrix.

Anyway, this function numpy.ma.cov , which is mentioned by ddodev , calculates the covariance, using the fact that the elements that are summed depend on only two values.

Thus, he calculates those that he can. Then, when it comes to the stage of dividing by n, it divides by the number of values ​​that were calculated (for this particular covvariance-matrix element), instead of the total number of samples.

+3
source

I would consider this page:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.cov.html

UPDATE:

Suppose that each row of your data matrix corresponds to a specific random variable, and the entries in the row are observations. You have a simple problem with missing data if you have a match between the observations. That is, if one of your lines has only 10 records, then these 10 records (i.e., Tests) correspond to 10 random samples in the first line? For example, suppose you have two temperature sensors and they take samples at the same time, but one of them is faulty and sometimes passes a sample. Then you must process the tests in which the faulty sensor skipped generating the readings as “missing data”. In your case, it is as simple as creating two vectors in NumPy that are the same length by placing zeros (or any value, really) in the smaller of the two vectors that correspond to the missing tests, and create a mask matrix that indicates where your missing values ​​exist in your data matrix.

Providing such a matrix for the function associated with the above should allow you to do exactly what you want.

+2
source

Source: https://habr.com/ru/post/1390126/


All Articles