DBSCAN to spark: which implementation

I would like to do DBSCAN on Spark. Currently, I have found 2 implementations:

I tested the first sbt configuration specified in her github, but:

  • functions in the bank do not coincide with functions in the document or in the source on github. For example, I can’t find the train function in the bank

  • I manage to run the test using the fit function (found in the jar), but the poor epsilon configuration (a bit big) puts the code in an infinite loop.

code:

val model = DBSCAN.fit(eps, minPoints, values, parallelism) 

Did anyone help with the first library?

Has anyone checked the second?

+7
source share
4 answers

Please try ELKI . Since this is Java, calling from Scala should be easy.

ELKI is very well optimized, and with indexes it will scale to fairly large datasets.

We tried to include one of these Spark implementations in our benchmarking study, but it ran out of memory (and this was the only implementation that did not have enough memory ... Spark and Mahout k-tools were also among the slowest):

Hans-Peter Kriegel, Erich Schubert and Arthur Ziemek.
The (black) art of evaluating runtime: do we compare algorithms or implementations?
In: Knowledge and Information Systems (CAIS). 2016, 1–38

Professor Neukirchen evaluated the parallel DBSCAN implementations in this technical report:

Helmut Neukirchen
Overview and performance assessment of DBSCAN spatial clustering implementations for big data and high performance computing paradigms

he obviously got some of the Spark implementations, but noted that:

The result is impressive: none of the Apache Spark implementations are close to the HPC implementations. In particular, for large (but still rather small) data sets, most of them completely fail and do not even give the correct results .

and before:

When running any of the "Spark DBSCAN" implementations using all available cores of our cluster, we ran into out-of-memory exceptions.

(also, “Spark DBSCAN” took 2406 seconds on 928 cores, ELKI took 997 seconds on 1 core for a smaller test - another Spark implementation also did not work too well, in particular, it did not return the correct result ...)

"DBSCAN on Spark" did not crash, but returned completely wrong clusters.

While "DBSCAN on Spark" completes faster, it gave completely incorrect clustering results. Due to the hopelessly long lead time for DBSCAN implementations for Spark with the maximum number of cores, we did not perform measurements with fewer cores.

You can wrap the double[][] array of double[][] in the ELKI database:

 // Adapter to load data from an existing array. DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data); // Create a database (which may contain multiple relations!) Database db = new StaticArrayDatabase(dbc, null); // Load the data into the database (do NOT forget to initialize...) db.initialize(); Clustering<Model> c = new DBSCAN<NumberVector>( EuclideanDistanceFunction.STATIC, eps, minpts).run(db); for(Cluster<KMeansModel> clu : c.getAllClusters()) { // Process clusters } 

See Also: Java API Example (specifically how to map DBIDs to row indices). To improve performance, pass the index factory (for example, new CoverTree.Factory(...) ) as the second parameter to the StaticArrayDatabase constructor.

+8
source

I am successfully using a second library ( https://github.com/alitouka/spark_dbscan ) in my project. In fact, I cannot use it as follows:

 libraryDependencies += "org.alitouka" % "spark_dbscan_2.10" % "0.0.4" resolvers += "Aliaksei Litouka repository" at "http://alitouka-public.s3-website-us-east-1.amazonaws.com/" 

Instead, I download the code and update it to version 2.2.1. In addition, some libraries need to be added. Finally, add the code to my project, it works!

+2
source

I tested https://github.com/irvingc/dbscan-on-spark and can say that it consumes a lot of memory. For a 400K smoothly-distributed dataset, I used -Xmx12084m, and even then it works too long (> 20 minutes). In addition, it is only for 2D. I used the project with maven, not sbt.

I also tested the second implementation. This is still the best I have found. Unfortunately, the author does not support it since 2015. It took some time to raise the version of Spark and resolve the version conflict. I needed to deploy it to aws.

+1
source

You may also consider using the smile that provides the DBSCAN implementation. You will have to use groupBy combination with mapGroups or flatMapGroups most direct way, and you will run dbscan there. Here is an example:

  import smile.clustering._ val dataset: Array[Array[Double]] = Array( Array(100, 100), Array(101, 100), Array(100, 101), Array(100, 100), Array(101, 100), Array(100, 101), Array(0, 0), Array(1, 0), Array(1, 2), Array(1, 1) ) val dbscanResult = dbscan(dataset, minPts = 3, radius = 5) println(dbscanResult) // output DBSCAN clusters of 10 data points: 0 6 (60.0%) 1 4 (40.0%) Noise 0 ( 0.0%) 

You can also write a custom aggregated function (UDAF) if you need to improve performance.

I use this approach in my work to cluster time series data, so grouping using the Spark time window function and the ability to execute DBSCAN in each window allows you to parallelize the implementation.

I was inspired by the following article to do this

0
source

Source: https://habr.com/ru/post/1245334/


All Articles