How large can a dataset handle ELKI?

I have 100,000 points that I would like to group using the OPTICS algorithm in ELKI. I have an upper triangular distance matrix of about 5 billion records for this set of points. In the format that ELKI wants to get in the matrix, it occupies about 100 GB in memory. I am wondering if ELKI handles such data loading? Can anyone confirm that you have done this work before?

+4
source share
1 answer

I often use ELKI with 100K dots, up to 10 million.

However, for this to be fast, you must use indexes .

For obvious reasons, any dense-matrix approach will at best be O(n^2) scaled and needs O(n^2) memory. That is why I cannot process these datasets using R or Weka or scipy. Usually, they first try to compute the total distance matrix and either fail halfway, or end up halfway, or fail with a negative allocation size (Weka when your data set overflows with 2 ^ 31 positive integers, that is, about 46k objects).

In binary format, accurate to float, the ELKI matrix format should be about 100000*999999/2*4 + 4 999999/2 100000*999999/2*4 + 4 bytes, maybe add 4 more bytes for size information. This is 20 GB . If you use the "easy to use" ascii format, then it really will be more. But if you use gzip compression, it may be about the same size. As a rule, gzip compresses such data to 10-20% of the raw size. In my experience, gzip compressed compression can be as small as binary encoded doubles . The main advantage of the binary format is that it will actually be on disk, and memory caching will be processed by your operating system.

In any case, I recommend not calculating distance matrices at all .

Because if you decide to go from 100,000 to 1 million, the original matrix will grow to 2 TB, and when you go to 10 million, it will be 200 TB. If you need double precision, double that.

If you use distance matrices, your method will at best be O(n^2) and therefore will not scale. An important factor in speed is to prevent the calculation of all pair distances.

I use indexes for everyone. For kNN or radius approaches (for OPTICS, use the epsion parameter to make indexes effective! Choose a low level!), You can pre-think these queries once if you need them again.

In the data set that I often use, with examples of 75k and 27 dimensions, the file that stores the pre-calculated 101 nearest nearest + double-precision connectives is 81 MB (note: this can be considered as a sparse similarity matrix). Using an index to pre-compute this cache, it only takes a few minutes to compute; and then I can run most kNN-based algorithms, such as LOF in this 75k dataset in 108 ms (+262 ms for loading the kNN cache + parsing 2364 ms of input source data, for a total duration of 3 seconds, where parsing of double values).

+4
source

Source: https://habr.com/ru/post/1501562/


All Articles