I am currently using a cluster that uses SGE. There I send a .shscript that calls a pythonscript (which is multi-threaded using multiprocessing.pool), in a parallel queue, by calling qsub run.sh. The pythonscript itself prints some progress through print(...). It then appears in the output file that is created by SGE. Now a huge problem arises: when I run the script manually, everything works like a charm, but when I use the parallel queue at some (random) iteration, the working pool seems to stop working, since further progress is not observed in the output file. Moreover, the processor load suddenly drops to 0%, and all threads of the script are just idle.
What can I do to solve this problem? Or how can I debug it? Since there are no error messages in the output file, I am really confused.
Change . Here are some parts of the shell script that are added to q and the necessary python files.
main.sh:
#!/bin/bash
python mainscript.py
mainscript.py:
def mainFunction():
worker = ClassWorker(...)
worker.startparallel()
if __name__== '__main__':
mainFunction()
resulting in being ClassWorkerdefined as follows:
class ClassWorker:
def _get_score(data):
params, fixed_params, trainingInput, trainingOutput, testingDataSequence, esnType = data
[... (the calculation is perfomed)]
dat = (test_mse, training_acc, params)
ClassWorker._get_score.q.put(dat)
return dat
def _get_score_init(q):
ClassWorker._get_score.q = q
def startparallel():
queue = Queue()
pool = Pool(processes=n_jobs, initializer=ClassWorker._get_score_init, initargs=[queue,] )
[...(setup jobs)]
[start async thread to watch for incoming results in the q to update the progress]
results = pool.map(GridSearchP._get_score, jobs)
pool.close()
Perhaps this helps to identify the problem. I did not include the real part of the calculation, since it has not caused any problems in the cluster so far, so this shoul will be safe.