An effective way to manipulate a large (does not fit into the memory) chart

I need to support a large directed graph G, possibly millions of nodes and edges. Potentially, this may not correspond to memory.

Some of the common operations that I need to perform on this chart include:

  • Each node / edge will have properties associated with it, such as access amount, weight, etc.

  • For each node (vertex) I will need to perform an efficient query based on property values. For example, find a node whose X value is greater than v1 but less than v2. This probably requires creating an index for certain fields.

  • I will need to find all incoming edges and outgoing edges of a given node and update the weight of the edges.

  • I will need to do a local (based on DFS) traversal from the given node and return all the paths that satisfy a specific user predicate (this predicate can use the node / edge property values ​​in the path).

  • I will need to effectively add / remove nodes / edges. This is not performed as often as operation 1, 2, 3.

Potentially, hotspots appear on the graph, which are accessed much more often than other parts, and I would like to cache these hotspots in memory.

What is an effective way to achieve this with minimal implementation effort?

I am looking at some disk-based databases such as Neo4j / InfiniteGraph / DEX. Despite the fact that they support all of the above operations, this seems redundant because I don't need many of the features that they offer, such as consistency / parallel control or cluster-based replication. In addition, many of them are based on Java, and I prefer something with a C / C ++ interface.

Basically, I just need a library on disk that efficiently handles persistence, query on nodes, and local crawl. Do you have any recommendations regarding an existing project (open source) that I can use? If not, what is the best way to implement such a thing?

+4
source share
4 answers

Since you "prefer something with a C / C ++ interface," you can try GraphChi to try.

"GraphChi can perform very large graph calculations on only one machine, using a new graph processing algorithm from disk (SSD or hard disk).

GraphChi source code and documentation for C ++ are available on the Google Project page .

Examples of applications include algorithms such as PageRank, Community Detection, and Connected Components. You can change them to suit your needs.

+1
source

Another option might be e4Graph , which is a simple C ++ persistence library. I have not tried it myself, but it looks promising.

+1
source

I have seen several large graphs with millions and millions of nodes. What I recommend is that you find the point, you have to do weighted compression. So you take N nodes, compress the N / M nodes using averages and weights .... and then rebuild the graph.

You would recompress after each so many nodes of your choice. The reason is that, since EVERYTHING is EXPECTED, you can, in a sense, normalize it over time.

You have a directed graph. As you go more at the large nodes, you can say that if A> B> (E & D)> H, you can say things like: A> H.

This allows you to define common routes between nodes based on the shortest transitions between nodes. If it is not in a compressed list, it will at least go to a certain area and may, depending ... be unpacked in a sense.

+1
source

If you are looking for a C ++ solution, GraphLab might be a good choice. But this is a distributed computing solution. Since you are dealing with millions of nodes / edges. This can be a good fit. This article lists some other graph libraries:

http://www.ibm.com/developerworks/library/os-giraph/

+1
source

Source: https://habr.com/ru/post/1484748/


All Articles