Please try ELKI . Since this is Java, calling from Scala should be easy.
ELKI is very well optimized, and with indexes it will scale to fairly large datasets.
We tried to include one of these Spark implementations in our benchmarking study, but it ran out of memory (and this was the only implementation that did not have enough memory ... Spark and Mahout k-tools were also among the slowest):
Hans-Peter Kriegel, Erich Schubert and Arthur Ziemek.
The (black) art of evaluating runtime: do we compare algorithms or implementations?
In: Knowledge and Information Systems (CAIS). 2016, 1–38
Professor Neukirchen evaluated the parallel DBSCAN implementations in this technical report:
Helmut Neukirchen
Overview and performance assessment of DBSCAN spatial clustering implementations for big data and high performance computing paradigms
he obviously got some of the Spark implementations, but noted that:
The result is impressive: none of the Apache Spark implementations are close to the HPC implementations. In particular, for large (but still rather small) data sets, most of them completely fail and do not even give the correct results .
and before:
When running any of the "Spark DBSCAN" implementations using all available cores of our cluster, we ran into out-of-memory exceptions.
(also, “Spark DBSCAN” took 2406 seconds on 928 cores, ELKI took 997 seconds on 1 core for a smaller test - another Spark implementation also did not work too well, in particular, it did not return the correct result ...)
"DBSCAN on Spark" did not crash, but returned completely wrong clusters.
While "DBSCAN on Spark" completes faster, it gave completely incorrect clustering results. Due to the hopelessly long lead time for DBSCAN implementations for Spark with the maximum number of cores, we did not perform measurements with fewer cores.
You can wrap the double[][]
array of double[][]
in the ELKI database:
// Adapter to load data from an existing array. DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data); // Create a database (which may contain multiple relations!) Database db = new StaticArrayDatabase(dbc, null); // Load the data into the database (do NOT forget to initialize...) db.initialize(); Clustering<Model> c = new DBSCAN<NumberVector>( EuclideanDistanceFunction.STATIC, eps, minpts).run(db); for(Cluster<KMeansModel> clu : c.getAllClusters()) { // Process clusters }
See Also: Java API Example (specifically how to map DBIDs to row indices). To improve performance, pass the index factory (for example, new CoverTree.Factory(...)
) as the second parameter to the StaticArrayDatabase
constructor.