How to calculate the correlation of two variables in a huge dataset in R?

I have a huge data set with six columns (name them A, B, C, D, E, F), about 450,000 rows. I just tried to find the correlation between columns A and B :

 cor(A, B) 

and i got

[1] NA

. What can I do to fix this problem?

+6
source share
2 answers

Try cor(A,B, use = "pairwise.complete.obs") . This will ignore the NS in your observations.

To be statistically rigorous, you should also look at the # missing entries in your data and see if there is a random assumption.

Edit 1: look at ?cor to see other options for the use parameter.

+13
source

You can use the rcorr function in the Hmisc package.

It is very fast and includes only paired full observations. The returned object contains a matrix

  • correlation indicators
  • with the number of observations used for each correlation value
  • p-values ​​for each correlation

The following is an example code example here :

+4
source

Source: https://habr.com/ru/post/898041/


All Articles