Can a pagerank be done without an entire dataset?

Question

Can a pagerank be done without an entire dataset?

Sorry if this is stupid, but I just thought I should take a picture. Say I have a huge graph (for example, 100 billion nodes). Neo4J supports 32 billion, while others support more or less the same, so say that I cannot simultaneously use the entire data set in the database, can I run pagerank on it if its oriented graph (without loops) and each a set of nodes are connected to the next set of nodes (therefore, new links will not be created back, only new links are created for new datasets).

Is there a way, somehow, you can take the previous estimates for pagerank and apply them to new datasets (I only care about pagerank for the most recent dataset, but need the previous pagerank to get the latest datasets)?

It makes sense? If so, can this be done?

+6

graph neo4j pagerank graph-theory

Lostsoul Apr 3 '12 at 0:19

source share

1 answer

dranxo · Accepted Answer · 2012-04-03T00:50:05+0000

You need to calculate the principle of a matrix eigenvector of 100 billion per 100 billion. If this is not very rare, you cannot fit inside your car. Thus, you need a way to calculate the leading eigenvector of the matrix when you can only look at a small part of your matrix at a time.

Iterative methods for calculating eigenvectors require that you save several vectors at each iteration (each of them will have 100 billion elements). They can fit on your machine (with 4 byte floats you will need about 375 GB per vector). When you have a ranking candidate vector, you can (very slowly) apply your giant matrix to it by reading the matrix in pieces (since you can watch 32 billion rows at a time, you will need a little over 3 pieces). Repeat this process and you will have the basics of the power method that is used in pagerank. cf http://www.ams.org/samplings/feature-column/fcarc-pagerank and http://en.wikipedia.org/wiki/Power_iteration p>

Of course, the limiting factor here is how many times you need to examine the matrix. It turns out that by storing multiple candidate vectors and using some randomized algorithms, you can get good accuracy with fewer readings of your data. This is the current research topic in the applied mathematical world. You can find more information here http://arxiv.org/abs/0909.4061 , here http://arxiv.org/abs/0909.4061 , and here http://arxiv.org/abs/0809.2274 . There the code is available here: http://code.google.com/p/redsvd/ , but you can’t just use this ready-made for the sizes of the data you are talking about.

Another way you can do this is to learn "incremental svd", which can improve your problem, but a little more complicated. Consider this note: http://www.cs.usask.ca/~spiteri/CSDA-06T0909e.pdf and this forum: https://mathoverflow.net/questions/32158/distributed-incremental-svd p>

Can a pagerank be done without an entire dataset?

More articles: