How to map this python script using mpi4py?

I apologize if this has already been asked, but I read a bunch of documentation and still do not know how to do what I would like to do.

I would like to run a Python script through several cores at the same time.

I have 1800 .h5 files in a directory with the names "snaphots_s1.h5", "snapshots_s2.h5", etc., each about 30 MB in size. This Python script:

  • Reads h5py files from one directory.
  • Retrieves and processes the data in the h5py file.
  • Creates graphs of the extracted data.

Once this is done, the script will then read in the next h5py file from the directory and follow the same procedure. Therefore, none of the processors should exchange data with others, performing this work.

The script looks like this:

import h5py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import cmocean
import os  

from mpi4py import MPI

de.logging_setup.rootlogger.setLevel('ERROR')

# Plot writes

count = 1
for filename in os.listdir('directory'):  ### [PERF] Applied to ~ 1800 .h5 files
    with h5py.File('directory/{}'.format(filename),'r') as file:

         ### Manipulate 'filename' data.  ### [PERF] Each fileI ~ 0.03 TB in size
         ...

         ### Plot 'filename' data.        ### [PERF] Some fileO is output here
         ...
count = count + 1

, mpi4py ( ), , multiprocessing.Pool( . ).

, : script mpi4py? , , script?

+4
3

, Javier , , .

, , , - . - , separetly.

def worker(fn):
    with h5py.File(fn, 'r') as f:
        # process data..
        return result

. .

, worker, , . ,

full_fns = [os.path.join('directory', filename) for filename in 
            os.listdir('directory')]

.

import multiprocessing as mp
pool = mp.Pool(4)  # pass the amount of processes you want
results = pool.map(worker, full_fns)  

# pool takes a worker function and input data
# you usually need to wait for all the subprocesses done their work before 
using the data; so you don't work on partial data.

pool.join()
poo.close()

results.

for r in results:
    print r

, .

+2

:

def process_one_file(fn):
    with h5py.File(fn, 'r') as f:
        ....
    return is_successful


fns = [os.path.join('directory', fn) for fn in os.listdir('directory')]
pool = multiprocessing.Pool()
for fn, is_successful in zip(fns, pool.imap(process_one_file, fns)):
    print(fn, "succedded?", is_successful)
+1

multiprocessing.

from multiprocessing.dummy import Pool

def processData(files):
    print files
    ...
    return result

allFiles = glob.glob("<file path/file mask>")
pool = Pool(6) # for 6 threads for example
results = pool.map(processData, allFiles)
+1

Source: https://habr.com/ru/post/1687435/


All Articles