Representation of a large graph with 100 million nodes in C ++

Question

Representation of a large graph with 100 million nodes in C ++

I am dealing with a very large graph with 500 million nodes, and the average degree of nodes is 100. Thus, this is a rather rare graph. I also have to store the weight of each edge. I am currently using two vectors for the following types

// V could be 100 million vector<int> *AdjList = new vector<int>[V]; vector<int> *Weight = new vector<int>[V];

Using vector of vector not spatially efficient. It occupies more than 400 GB of memory. Is there a better way to save space for storing this large graph in memory? Any suggestion to use any C ++ library?

+5

c ++ vector graph bigdata

viz12 Nov 11 '16 at 10:47

source share

3 answers

You must implement the graph as a binary decision chart data structure .

In short, the idea is that a graph can be represented as a binary function using the characteristic function of a graph.

There are several ways to encode a graph as a binary function using a characteristic function. There is a way to do this in the article and video that I posted at the end of my post.

BDD encodes binary functions in a compact way with fast operations. This is probably the most powerful data structure in the universe.

The idea of BDD is almost the same as in the tree , but on each node we do not send the next input function, but instead each node has an attribute X , which represents the index of the variable and, if the function F (.. X = true ..) is true , continue on the upper branch of the node and reach the leaf true , if F (.. X = true ..) is true, continue on the lower branch down to the leaf node, representing true. This is called Shannon extension of the Boolean function (using the same extension formula also allows you to calculate the hardware design of the Boolean function using multiplexers).

In general, for every possible combination of input values X_i for which the function is true, we have a unique branch that goes from the root node to the true leaf, branching in each node depending on the input variable Xi (we branch out to a low high direction depending from the value of true or false Xi). The same diagram can be used to save several functions (each node is a separate function).

There are 2 optimizations for converting from a binary decision tree to a binary decision diagram, which makes it compact. The idea of optimization is identical to optimization from the finite state machine minimization algorithm. As in the case of automata, the minimum BDD is unique for a function (therefore, to see if 2 arbitrary functions coincide, it is enough to convert them to BDD and see if the node representing one function coincides with the root node for another function (complexity O (1 ) (constant time) to compare 2 pointer values)).

One optimization says: if a node has all edges in the same physical nodes as another node, we merge both nodes into one (this can be done during creation, while maintaining a hash table of all created nodes).

Another optimization says that if the lower edge and the upper edge of the node for the variable X are in the same physical node of the variable Y, the node X disappears because the function has the same value for F (... X = true ...) = F (... X = false ...).

There are thousands of articles about BDD and its derivatives (changing the interpretation of scheduling at each node, we get, for example, ZDD, for a compact representation of disordered sets). A typical article on the topic " What graphics can be effectively presented by BDD?". C. Dong P. Molitor.

Once you understand the basics of BDD, if you have the opportunity for a longer presentation, this video is excellent and it summarizes how to encode graphics as BDD.

BDD is what professional software does today when you need to manage millions of nodes.

+3

alinsoar May 26 '18 at 5:32

source share

Well, if the graph is not changed, consider using two vectors:

 struct edge { int target; int weight; }; std::vector<edge> edges; std::vector<int> nodes; // begin-index into edges

This should minimize overhead.

This, of course, if you can not further reduce the schedule.

0

Deduplicator Jan 27 '19 at 22:24

source share

Christophe · Accepted Answer · 2016-12-12T22:18:22+0000

Preliminary notes

You might consider using vector vectors instead of dynamically allocating memory:

 vector<vector<int>> AdjList(V);

In any case, you will have V another vector<int> in the adjacency list. Each vector needs some space to control the size and location of its elements. Unfortunately, you double this overhead (and the associated management of hidden memory when adding new links) while keeping the weight in another vector / array.

So why not rearrange the adjacency list and weight?

 struct Link { int target; // node number that was in adj list. Hope none is negative!! int weight; }; vector<vector<Link>> AdjList(V);

Is the structure rare?

If the vast majority of nodes have some kind of connection, this is quite normal.

If, on the contrary, many nodes do not have an outbound link (or if you have large unused ranges of node identifiers), you might think:

 map<int, vector<Link>> AdjList;

A map is an associative array. There will only be vectors for nodes that have outbound links. By the way, you can use any numbering scheme for your nodes, even negative.

You can even go further and use a dual card. The first card gives you outgoing nodes. The second map displays the target node by weight:

 map<int, map<int, int>> Oulala;

But it risks being a lot more memory.

Large volumes?

map and vector manage memory dynamically using the default allocator. But you have many small objects of a predetermined size. So you can consider using your own dispenser . This can significantly optimize the memory management overhead.

Also, if you use vectors when you load the adjacency list of a new node, it can be effective to immediately reserve the size for the vector (if you know it). This could avoid several consecutive redistributions for the growth of the vector. With millions of nodes, this can be very expensive.

Libraries?

Searching for third-party libraries is beyond the scope of SO. But if the above tips are not enough, you can use the existing graph library, for example:

Boost Graph Library : Enhancement Benefit
SNAP: Standford Network Analysis Platform : A library created (and used) for huge graphs with millions of nodes. (The network here means a graph with data about nodes and edges)

There are a couple of other graph libraries around, but many seem to be either no longer supported or not designed for large volumes.

Representation of a large graph with 100 million nodes in C ++

More articles: