Splitting with Spark Graphframes

I work with a rather large (?) Graph (60 million vertices and 9.5 billion edges) using Spark Graphframes. The basic data is small - the peaks occupy about 500 MB on the disk, and the edges - about 40 GB. My containers often close due to memory problems due to memory problems, but I think the main problem is that the graphics frame is constantly shuffling the data (I see that shuffle read / write is up to 150gb). Is there a way to efficiently split Graphframe or base edges / vertices to reduce shuffling?

+6
source share
1 answer

TL DR Cannot efficiently split Graphframe .

Graphframe algorithms can be divided into two categories:

  • Ways to delegate processing to GraphX . GraphX ​​supports a number of split methods, but they are not displayed through the Graphframe API. If you use one of them, it might be better to use GraphX directly.

    Unfortunately, the development of GraphX almost completely stopped with only minor corrections over the past two years, and the overall performance is very disappointing in comparison with both internal libraries and non-corporate libraries .

  • Methods that are initially implemented using Spark Datasets , which consider a limited programming model and only one split mode, are extremely unsuitable for complex graph processing.

    While relational column storage can be used to efficiently process the graph, the naive iterative join approach used by Graphframes just does not scale (but this is normal for a shallow intersection with one or two flights).

    You can try DataFrames vertices and edges DataFrames into id and src respectively:

     val nPart: Int = ??? GraphFrame(v.repartition(nPart, v("id")), e.repartition(e(nPart, "src"))) 

    which should help in some cases.

In general, in the current state (December 2016), Spark is not a good choice for intensive graph analytics.

+6
source

Source: https://habr.com/ru/post/1013558/


All Articles