R yields hierarchical clustering after multiple match analysis

I want to copy a data set (600,000 cases), and for each cluster I want to get the main components. My vectors consist of one email and 30 quality variables. Each quantitative variable has 4 classes: 0,1,2 and 3.

So, the first thing I do is load the FactoMineR library and load my data:

library(FactoMineR) mydata = read.csv("/home/tom/Desktop/ACM/acm.csv") 

Then I set my variables as qualitative (I exclude the "email" variable):

 for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])} 

I delete emails from my vectors:

 mydata2 = mydata[2:31] 

And I run MCA in this new dataset:

 mca.res <- MCA(mydata2) 

Now I want to group my data set using the hcpc function:

 res.hcpc <- HCPC(mca.res) 

But I got the following error message:

 Error: cannot allocate vector of size 1296.0 Gb 

What do you think I should do? Is my dataset too large? Am I using the hcpc function well?

+5
source share
2 answers

Since it uses hierarchical clustering, HCPC needs to calculate the lower triangle of the 600,000 x 600,000 distance matrix (~ 180 billion elements). You simply do not have RAM to store this object, and even if you did, it would probably take several hours to calculate if the days were not completed.

Various discussions were held on stack overflow / cross-validation when clustering large datasets; Some with solutions in R include:

k-means clustering in R on a very large, sparse matrix? ( bigkmeans )

Big data cluster in R and matching sample? ( clara )

If you want to use one of these alternative clustering approaches, you should apply it to mca.res$ind$coord in your example.

Another idea proposed in response to the clustering problem of a very large dataset in R is to first use k to find a certain number of cluster centers, and then use hierarchical clustering to build a tree from there. This method is actually implemented using the kk HCPC argument.

For example, using the tea dataset from FactoMineR :

 library(FactoMineR) data(tea) ## run MCA as in ?MCA res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE) ## run HCPC for all 300 individuals hc <- HCPC(res.mca, kk = Inf, consol = FALSE) ## run HCPC from 30 k means centres res.consol <- NULL ## bug work-around hc2 <- HCPC(res.mca, kk = 30, consol = FALSE) 

The consol argument offers the ability to consolidate clusters from hierarchical clustering using k-tools; this option is not available if kk set to a real number, so consol here is set to FALSE . The res.consul object res.consul set to NULL to handle a minor error in FactoMineR 1.27.

The following graph shows clusters based on 300 people ( kk = Inf ) and based on centers of average 30 k ( kk = 30 ) for data plotted on the first two MCA axes:

enter image description here

You can see that the results are very similar. You can easily apply this to your data with 600 or 1000 k funds centers, possibly up to 6000 with 8 GB of RAM. If you want to use a larger number, you probably want to encode a more efficient version with bigkmeans , SpatialTools::dist1 and fastcluster::hclust .

+5
source

This error message usually indicates that R does not have enough RAM to execute the command. I think you are using this on 32bit R, possibly under Windows? If so, then other processes and the removal of unused R-variables may be killed: for example, you can try to delete mydata , mydata2 with

 rm(mydata, mydata2) 

(as well as all other unnecessary R-variables) before executing a command that generates an error. However, the final solution in the general case is to switch to 64-bit R, preferably under 64-bit Linux and with a decent amount of memory, see also here:

R memory management / cannot allocate a vector of size n Mb

R Memory allocation "Error: cannot allocate a 75.1 MB vector"

http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

0
source

Source: https://habr.com/ru/post/1208221/


All Articles