Python page rank

I am new to Python and I am trying to calculate the Page Rank vector according to this equation in Python: enter image description here

Where Pi (k) is the page rank vector after the kth iteration, G is the Google matrix, H is the matrix of hyperlinks, a is the dangling node vector, alpha = 0.85 and e is the vector of them.

Calculating with G takes a lot of time, and when using the hyperlink matrix H , which is a sparse matrix, it takes significantly less time.

Here is my code:

for i in range(1, k_steps+1): for j in range(0, len(dictionary_urls)): for k in range(0, len(dictionary_urls)): if matrix_H[k][j] != 0: matrix_pi_k[i][j] += matrix_pi_k[i-1][k] * float(matrix_H[k][j]) alpha_pi_k_a += matrix_pi_k[i-1][k]*float(vector_a[k]) alpha_pi_k_a = alpha_pi_k_a * float(alpha) alpha_pi_k_a = alpha_pi_k_a + float((1- alpha)) alpha_pi_k_a = alpha_pi_k_a / float(len(dictionary_urls)) matrix_pi_k[i][j] = matrix_pi_k[i][j] * float(alpha) matrix_pi_k[i][j] = matrix_pi_k[i][j] + float(alpha_pi_k_a) alpha_pi_k_a = 0 

k_steps is the number of iterations required.

dictionary_links contains all the urls.

After executing the code, matrix_pi_k should have all Pi vector elements

I calculated all the necessary variables. I got the runtime using H , the matrix is ​​almost equal to the runtime using G , although in theory it should be different.

Why? And what should I change to reduce lead time?

Thanks.

+5
source share
2 answers

The problem is that you are multiplying the sparse matrix by a dense vector using the same old matrix vector multiplication algorithm. You will not see any acceleration in this way.

Suppose you have a matrix nxn A (dense or sparse) and n -vector x . To compute y = Ax , we can write:

 y = [0]*n for i in range(n): for j in range(n): y[i] += A[i,j]*x[j] 

This works whether the matrix A dense or sparse. Suppose A sparse. We still iterate over all columns of A to compute one record y , although most of the records will be zero. Thus, the outer loop goes through n iterations, and the inner loop also goes through n iterations.

If we know which entries from A are nonzero, we can do much better. Suppose we have a list of all nonzero elements of row i , call it nonzero[i] . Then we can replace the inner loop with an iteration over this list:

 y = [0]*n for i in range(n): for j in nonzero[i]: y[i] += A[i,j]*x[j] 

Thus, although our outer loop iterates through n , the inner loop does as many iterations as there are non-zero entries.

Here, acceleration occurs with sparse multiplication of matrix vectors.

Use numpy !

But you have one more problem: you are trying to do matrix multiplication with pure Python, which (due to type checking, non-contiguous data structures, etc.) is slow. The solution is to use numpy , which provides fast algorithms and data structure. You can then use scipy sparse matrices , as they implement the fast sparse matrix vector transform for you.

Experiment

We can show it all with a quick experiment. First we create a dense matrix of 10,000 x 10,000 A :

 >>> import numpy as np >>> n = 10000 >>> A = np.random.sample((n,n)) 

Then we make the sparse matrix B threshold A B is the same size as A , but only 10% of its entries are non-zero:

 >>> B = np.where(A < .1, A, 0).astype(float) 

Now we make a dense vector to multiply A and B by:

 >>> x = np.random.sample(n) >>> %timeit A.dot(x) 10 loops, best of 3: 46.7 ms per loop >>> %timeit B.dot(x) 10 loops, best of 3: 43.7 ms per loop 

It takes as long to calculate Ax as it does to calculate Bx , although B is sparse. Of course, it is not very sparse: it is stored as a dense matrix with a large number of zero records. Let me make it rare:

 >>> sparse_B = scipy.sparse.csr_matrix(B) >>> 100 loops, best of 3: 12 ms per loop 

There is our acceleration! Now, just for fun, what if we keep A as a sparse matrix, although it's really dense?

 >>> sparse_A = scipy.sparse.csr_matrix(A) >>> %timeit sparse_A.dot(x) 10 loops, best of 3: 112 ms per loop 

Oh! But this is to be expected, since keeping A as a sparse matrix will lead to some overhead during multiplication.

+5
source

Based on your formula, calculating matrix H does not look faster than for matrix G.

Explanation:

Perhaps you should take a look at the introduction to Big O notation .

The right-most part (after + ) in the formula consists only in simple calculation without cycles, and its Big O notation is just O(1) . This means that it does not depend on the number of URLs that you take into account.

While the calculations for both H and G seem to be at least O(n^2) ( n number of URLs).

Edit:

In the deep nested part of your code, you have two instructions, one of which depends on whether matrix_H[k][j] 0 or not. However, if it is 0, it will be in most cases, if H is a sparse matrix, then the second instruction will be executed. In addition, you enter the loop anyway.

This still gives you O(n^2) complexity, so parsing H is not much (much) than parsing G.

0
source

Source: https://habr.com/ru/post/1209719/


All Articles