MapReduce, Python, and NetworkX

Question

MapReduce, Python, and NetworkX

I applied an unweighted random walk function to a graph that I built in Python using NetworkX. Below is a snippet of my random walk program. Elsewhere in my program, I have a method that creates a graph, and I have a method that mimics the various custom graph testing methods that I wrote. One of these graph testing methods selects two nodes randomly from the graph and randomly wanders between them. Two things that are computed from this random walk reach time (the number of links that intersect from start to end point) and switching time (number of lines traveled from start to end and back to start point).

def unweighted_random_walk(starting_point,ending_point, graph): ''' starting_point: String that represents the starting point in the graph ending_point: String that represents the ending point in the graph graph: A NetworkX Graph object ''' ##Begin the random walk current_point=starting_point #current_node=graph[current_point] current_point_neighors=graph.neighbors(current_point) hitting_time=0 #Determine the hitting time to get to an arbitrary neighbor of the #starting point while current_point!=ending_point: #pick one of the edges out of the starting_node with equal probs possible_destination=current_point_neighbors[random.randint(0,current_point_neighors)] current_point=possible_destination current_point_neighbors=graph.neighbors(current_point) hitting_time+=1 return hitting_time

My random walk code is pretty straight forward because I just select random nodes before reaching the endpoint. However, this current implementation is very slow when I try to run some random walks (I think I need to run a million at some point).

My question is: is there a way to use Hadoop MapReduce to parallelize some of the operations that occur here for this random walk? Is there a better way to make my casual walk?

+2

python mapreduce hadoop networkx

Gobiaskoffi Nov 07 '09 at 20:05

source share

3 answers

I do not see how map-reduce can help. It is used where you have a two-part operation: the first part is a calculation that can be performed independently on different data elements, and the second part somehow combines all these results. There may be a smart way to use map-reduce to help with this random move, but I don't see it.

Your random walk is completely random: it can end in many cycles, even jumping back and forth between the same two nodes before continuing. Perhaps you want to limit it in some way so that you do not have such a large search space?

+4

Ned batchelder Nov 07 '09 at 20:22

source share

You do not need to take a random walk if you use the formula described in detail in this article .

0

Andrei Simionescu Nov 12 '13 at 15:07

source share

gotgenes · Accepted Answer · 2009-11-07T21:53:14+0000

To answer the question:

You need to refer to Ned's comment. He beat me to say that. Explain your code; more on this later.
I can not understand the walking algorithm that can be run in parallel. By their nature, each of them is a linear process; each step depends on the previous one. You cannot know which next node to go to without knowing the previous node (except for the starting node). If your code really is a random walk, where the choice does not depend on the previous ones, you should explain this in your question.
Assuming that each random walk is independent, you can run many random walks at the same time. We call this scenario uncomfortably parallel , and this is a very good thing.
I have no idea why you want to use Hadoop, specifically here. The first step should be: "Can I just write this as the base program and use the qsub (or equivalent) script to process a bunch of runs of this program on the server?" If the answer is no, the next step: "Can I use the multiprocessor module ?" If you move on to multiprocessing, you can take a look at the multiprocessor presentation of Jesse Noller from PyCon 2009 .

Now, regarding your specific code ...

You need to explain what the nodes of your graph are. I am confused why you are treating them like a dictionary (calling .keys() ) on them. If they are dictionaries, tell us what keys and values are. I hope you do not store neighbors as keys there, because NetworkX already gives you this using the Graph.neighbors() method. If you store the neighbors of the nodes in the nodes themselves, you have a misunderstanding of the NetworkX library. Let the schedule do the work for you.
You have the same logic twice in unweighted_random_walk() , once for a trip from the beginning of the node to the destination of the node, and then for the destination of the node to the beginning of the node. What for? All you need is logic for one direction. Call this function twice. Call it with the start and target nodes as arguments to get the direction in one direction, and then change the order of the arguments to the destination, and then start moving in the other direction. Then you have two independent calls, and now they can run in parallel.
Do not use while True: - not only here, but in general. You should always indicate the actual state in which you should continue. eg.
```
 while current_point != ending_point: ... 
```
Do not return a string of information, directly return information. eg.
```
 return hitting_time 
```
Please note that following my advice in paragraph 2 immediately above, you only need to return the impact time and sum the impact time for this call and callback to get the total switching time. Convenient, right?

MapReduce, Python, and NetworkX

More articles: