SGE parallel queue does not execute python code

Question

SGE parallel queue does not execute python code

I am currently using a cluster that uses SGE. There I send a .shscript that calls a pythonscript (which is multi-threaded using multiprocessing.pool), in a parallel queue, by calling qsub run.sh. The pythonscript itself prints some progress through print(...). It then appears in the output file that is created by SGE. Now a huge problem arises: when I run the script manually, everything works like a charm, but when I use the parallel queue at some (random) iteration, the working pool seems to stop working, since further progress is not observed in the output file. Moreover, the processor load suddenly drops to 0%, and all threads of the script are just idle.

What can I do to solve this problem? Or how can I debug it? Since there are no error messages in the output file, I am really confused.

Change . Here are some parts of the shell script that are added to q and the necessary python files.

main.sh:

#!/bin/bash

# Use python as shell
#$ -S /bin/bash

# Preserve environment variables
#$ -V

# Execute from current working directory
#$ -cwd

# Merge standard output and standard error into one file
#$ -j yes

# Standard name of the job (if none is given on the command line):
#$ -N vh_esn_gs

# Path for the output files
#$ -o /home/<username>/q-out/

# Limit memory usage
#$ -hard -l h_vmem=62G

# array range
#$ -t 1-2

# parallel
#$ -pe <qname> 16

#$ -q <qname>

python mainscript.py

mainscript.py:

#read parameters etc [...]
def mainFunction():
    worker = ClassWorker(...)
    worker.startparallel()

if __name__== '__main__':
    mainFunction()

resulting in being ClassWorkerdefined as follows:

 class ClassWorker:
      def _get_score(data):
          params, fixed_params, trainingInput, trainingOutput, testingDataSequence, esnType = data
          [... (the calculation is perfomed)]
          dat = (test_mse, training_acc, params)
          ClassWorker._get_score.q.put(dat)

          return dat

    def _get_score_init(q):
         ClassWorker._get_score.q = q

    def startparallel():
         queue = Queue()
         pool = Pool(processes=n_jobs, initializer=ClassWorker._get_score_init, initargs=[queue,] )
         [...(setup jobs)]
         [start async thread to watch for incoming results in the q to update the progress]

         results = pool.map(GridSearchP._get_score, jobs)
         pool.close()

Perhaps this helps to identify the problem. I did not include the real part of the calculation, since it has not caused any problems in the cluster so far, so this shoul will be safe.

+4

python queue cluster-computing sungridengine

Flashtext May 04 '17 at 14:51

source share

No one has answered this question yet.

See related questions:

5504

Does Python have a ternary conditional operator?

3602

Does Python have a "contains" substring method?

804

How to get Python program execution time?