Large distance matrix in clustering

I am running R 3.2.3 on a machine with 16 GB of RAM. I have a large matrix of 3,000,000 rows x 12 columns. I want to use the hierarchical clustering algorithm in R, so before I do this, I am trying to create a distance matrix. Since the data is of mixed type, I use different matrices for different types. I get an error message:

df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5))
d1=as.dist(distm(df[,c(1:2)])/10^5)
d2=dist(df[,c(3:8)], method = "euclidean") 
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)

I get the following error

> d1=as.dist(distm(df1[,c(1:2)])/10^5)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
> d2=dist(df1[,c(3:8)], method = "euclidean") 
Error: cannot allocate vector of size 335.3 Gb
In addition: Warning messages:
1: In dist(df1[, c(3:8)], method = "euclidean") :
 Reached total allocation of 16070Mb: see help(memory.size)
2: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
3: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
4: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
+4
source share
1 answer

Suppose you have 1 row (A) for a cluster with a matrix of 3 ^ 8 (B) at the minimum distance.

Original approach:

1. load A and B
2. distance compute A with each row of B
3. select smallest one from results (reduction)

- , B , .

:

1. load A (suppose it is small)
2. load B.partial with 1 to 1^5 rows of B
3. compute distance of A with each row of B.partial
4. select min one in partial results and save it as res[i]
5. go back 2.) load next 1^5 rows of B 
6. final your got a 3000 partial results and saved in res[1:3000]
7. reduction : select min one from res[1:3000]
   note: if you need all distances as `dist` function, you don't need reduction and just keep this array.

. , . .

, . .


dist , R, openMP. . openMP .

, OMP_NUM_THREADS 4 8 , , !

 void R_distance(double *x, int *nr, int *nc, double *d, int *diag,
    int *method, double *p)
{
     int dc, i, j;
     size_t  ij;  /* can exceed 2^31 - 1 */
     double (*distfun)(double*, int, int, int, int) = NULL;
     #ifdef _OPENMP
        int nthreads;
     #endif
     .....
 }

, dist GPU, talk ParallelR.

+3

Source: https://habr.com/ru/post/1620032/


All Articles