How to solve prcomp.default (): impossible to rescale constant / zero column for unit variance

I have a data set of 9 samples (rows) with 51608 variables (columns), and I get an error all the time when I try to scale it:

It works great

pca = prcomp(pca_data) 

but

 pca = prcomp(pca_data, scale = T) 

gives

 > Error in prcomp.default(pca_data, center = T, scale = T) : cannot rescale a constant/zero column to unit variance 

Obviously, it's a little difficult to post a reproducible example. Any ideas what the deal is?

Search for constant columns:

  sapply(1:ncol(pca_data), function(x){ length = unique(pca_data[, x]) %>% length }) %>% table 

Output:

  . 2 3 4 5 6 7 8 9 3892 4189 2124 1783 1622 2078 5179 30741 

So there are no constant columns. Same thing with NA -

  is.na(pca_data) %>% sum >[1] 0 

This works great:

  pca_data = scale(pca_data) 

But then both still give the same error:

  pca = prcomp(pca_data) pca = prcomp(pca_data, center = F, scale = F) 

So why can't I get scaled information about this data? Ok, let's make it 100% sure that it is not permanent.

  pca_data = pca_data + rnorm(nrow(pca_data) * ncol(pca_data)) 

The same mistakes. Numierc data?

  sapply( 1:nrow(pca_data), function(row){ sapply(1:ncol(pca_data), function(column){ !is.numeric(pca_data[row, column]) }) } ) %>% sum 

All the same mistakes. I have no ideas.

Edit: more and hack at least him.

Later, it is still not easy to put this data, for example:

  Error in hclust(d, method = "ward.D") : NaN dissimilarity value in intermediate results. 

The trim value under a specific cut, for example, 1, did not affect zero. What ultimately worked was cropping all the columns whose column had more than zero. Worked for # zeros <= 6, but 7+ gave errors. I don't know if this means that this is a problem at all or if it just happened to catch the problem column. However, it would be nice to hear if anyone has any ideas, because this should work fine if no variable is all zeros (or constants differently).

+13
source share
2 answers

I don't think you were looking for null columns correctly. Try using some dummy data. First, an acceptable matrix: 10x100:

 mat <- matrix(rnorm(1000, 0), nrow = 10) 

And one with a zero dispersion column. Let me call him oopsmat .

 const <- rep(0.1,100) oopsmat <- cbind(const, mat) 

The first few oopsmat elements are as follows:

  const [1,] 0.1 0.75048899 0.5997527 -0.151815650 0.01002536 0.6736613 -0.225324647 -0.64374844 -0.7879052 [2,] 0.1 0.09143491 -0.8732389 -1.844355560 0.23682805 0.4353462 -0.148243210 0.61859245 0.5691021 [3,] 0.1 -0.80649512 1.3929716 -1.438738923 -0.09881381 0.2504555 -0.857300053 -0.98528008 0.9816383 [4,] 0.1 0.49174471 -0.8110623 -0.941413109 -0.70916436 1.3332522 0.003040624 0.29067871 -0.3752594 [5,] 0.1 1.20068447 -0.9811222 0.928731706 -1.97469637 -1.1374734 0.661594937 2.96029102 0.6040814 

Try scaled and unscaled PCA on oopsmat :

 PCs <- prcomp(oopsmat) #works PCs <- prcomp(oopsmat, scale. = T) #not forgetting the dot #Error in prcomp.default(oopsmat, scale. = T) : #cannot rescale a constant/zero column to unit variance 

Because you cannot separate the standard deviation if it is infinite. To define a column with zero variance, we can use which as follows to get the variable name.

 which(apply(oopsmat, 2, var)==0) #const #1 

And to remove zero-variance columns from a dataset, you can use the same apply expression, setting the variance to non-zero.

 oopsmat[ , apply(oopsmat, 2, var) != 0] 

Hope this helps make things clearer!

+23
source

In addition to Joe's answer, just make sure the column classes in your data frame are numeric.

If there are integers, then you get a variance of 0, which will cause the scaling to fail.

So if

 class(my_df$some_column) 

is an integer of 64 for example then do the following

 my_df$some_column <- as.numeric(my_df$some_column) 

Hope this helps someone.

+3
source

Source: https://habr.com/ru/post/1011832/


All Articles