TL DR Cannot efficiently split Graphframe
.
Graphframe
algorithms can be divided into two categories:
Ways to delegate processing to GraphX
. GraphX ββsupports a number of split methods, but they are not displayed through the Graphframe
API. If you use one of them, it might be better to use GraphX
directly.
Unfortunately, the development of GraphX
almost completely stopped with only minor corrections over the past two years, and the overall performance is very disappointing in comparison with both internal libraries and non-corporate libraries .
Methods that are initially implemented using Spark Datasets
, which consider a limited programming model and only one split mode, are extremely unsuitable for complex graph processing.
While relational column storage can be used to efficiently process the graph, the naive iterative join
approach used by Graphframes
just does not scale (but this is normal for a shallow intersection with one or two flights).
You can try DataFrames
vertices
and edges
DataFrames
into id
and src
respectively:
val nPart: Int = ??? GraphFrame(v.repartition(nPart, v("id")), e.repartition(e(nPart, "src")))
which should help in some cases.
In general, in the current state (December 2016), Spark is not a good choice for intensive graph analytics.
source share