Multiprocessing.Pool with a global variable

Question

Multiprocessing.Pool with a global variable

I am using the Pool class from the python multiprocessing library, I am writing a program that will run in an HPC cluster.

Here is an abstraction of what I'm trying to do:

def myFunction(x): # myObject is a global variable in this case return myFunction2(x, myObject) def myFunction2(x,myObject): myObject.modify() # here I am calling some method that changes myObject return myObject.f(x) poolVar = Pool() argsArray = [ARGS ARRAY GOES HERE] output = poolVar.map(myFunction, argsArray)

The function f (x) is contained in the * .so file, i.e. calls function C.

The problem that I encountered is that the value of the output variable is different every time I run my program (although the function myObject.f () is a deterministic function). (If I have only one process, then the output variable is the same each time the program starts.)

I tried to create an object, and not store it as a global variable:

 def myFunction(x): myObject = createObject() return myFunction2(x, myObject)

However, in my program, creating an object is expensive, and thus it is much easier to create myObject once and then change it every time I call myFunction2 (). Thus, I would not want to create an object every time.

Do you have any tips? I am very new to parallel programming, so I could be wrong. I decided to use the Pool class, because I wanted to start with something simple. But I am ready to try the best way to do this.

+6

python multiprocessing

Hugh medal Sep 13 '13 at 4:12

source share

1 answer

Bakuriu · Answer 1 · 2013-09-13T05:38:21+0000

I am using the Pool class from the python multiprocessing library to do some processing of shared memory in an HPC cluster.

Processes are not threads! You cannot just replace Thread with Process and expect everyone to work the same way. Process es do not make shared memory, which means that global variables are copied, so their value in the original process does not change.

If you want to use shared memory between processes, you must use multiprocessing data types such as Value , Array , or use Manager to create shared lists, etc.

In particular, you might be interested in the Manager.register method, which allows Manager create common user objects (although they must be selected).

However, I'm not sure if this will improve performance. Since any connection between processes requires etching, and etching usually takes longer, it simply creates an instance of the object.

Note that you can do some initialization of workflows that pass initializer and initargs when creating the Pool .

For example, in its simplest form, create a global variable in a workflow:

 def initializer(): global data data = createObject()

Used as:

 pool = Pool(4, initializer, ())

Then the working functions can easily use the global variable data .

Style Note: Never use the built-in name for your variables / modules. In your case, object is inline. Otherwise, you will receive unexpected errors that may be unclear and difficult to track.

Multiprocessing.Pool with a global variable

More articles: