I am learning OpenMPI in a cluster. Here is my first example. I expect that the output will display a response from different nodes, but they all respond from the same node node062. I'm just wondering why and how I can get a report from different nodes to show that MPI actually distributes processes to different nodes? Thank you and welcome!
ex1.c
#include "mpi.h" #include <stdio.h> #include <string.h> int main(int argc, char **argv) { char idstr[2232]; char buff[22128]; char processor_name[MPI_MAX_PROCESSOR_NAME]; int numprocs; int myid; int i; int namelen; MPI_Status stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name, &namelen); if(myid == 0) { printf("WE have %d processors\n", numprocs); for(i=1;i<numprocs;i++) { sprintf(buff, "Hello %d", i); MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); } for(i=1;i<numprocs;i++) { MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat); printf("%s\n", buff); } } else { MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat); sprintf(idstr, " Processor %d at node %s ", myid, processor_name); strcat(buff, idstr); strcat(buff, "reporting for duty\n"); MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); }
ex1.pbs
#!/bin/sh
compile and run:
[ tim@user1 examples]$ mpicc ./ex1.c -o ex1 [ tim@user1 examples]$ qsub ex1.pbs 35540.mgt [ tim@user1 examples]$ nano ex1.o35540 ---------------------------------------- Begin PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883 Job ID: 35540.mgt Username: tim Group: Brown Nodes: node062 node063 node169 node170 node171 node172 node174 node175 node176 node177 End PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883 ---------------------------------------- WE have 10 processors Hello 1 Processor 1 at node node062 reporting for duty Hello 2 Processor 2 at node node062 reporting for duty Hello 3 Processor 3 at node node062 reporting for duty Hello 4 Processor 4 at node node062 reporting for duty Hello 5 Processor 5 at node node062 reporting for duty Hello 6 Processor 6 at node node062 reporting for duty Hello 7 Processor 7 at node node062 reporting for duty Hello 8 Processor 8 at node node062 reporting for duty Hello 9 Processor 9 at node node062 reporting for duty ---------------------------------------- Begin PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891 Job ID: 35540.mgt Username: tim Group: Brown Job Name: ex1 Session: 15533 Limits: neednodes=10:ppn=1,nodes=10:ppn=1,walltime=01:10:00 Resources: cput=00:00:00,mem=420kb,vmem=8216kb,walltime=00:00:03 Queue: dque Account: Nodes: node062 node063 node169 node170 node171 node172 node174 node175 node176 node177 Killing leftovers... End PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891 ----------------------------------------
UPDATE:
I would like to run several background jobs in one PBS script so that jobs can run at the same time. for example, in the example above, I added another call to run ex1 and change both runs as background in ex1.pbs
#!/bin/sh
(1) The result is after qsub of this script with the previous compiled executable ex1.
The first job starts! The first job ends! The second job starts! The second job ends! WE have 5 processors WE have 5 processors Hello 1 Processor 1 at node node063 reporting for duty Hello 2 Processor 2 at node node169 reporting for duty Hello 3 Processor 3 at node node170 reporting for duty Hello 1 Processor 1 at node node063 reporting for duty Hello 4 Processor 4 at node node171 reporting for duty Hello 2 Processor 2 at node node169 reporting for duty Hello 3 Processor 3 at node node170 reporting for duty Hello 4 Processor 4 at node node171 reporting for duty
(2) However, I think the running time of ex1 is too fast and probably the two background jobs don't have too many overlaps, which is not the case when I apply the same to my real project. Therefore, I added sleep (30) to ex1.c to extend the running time of ex1, so that two jobs running ex1 in the background will run almost all the time at the same time.
#include "mpi.h" #include <stdio.h> #include <string.h> #include <unistd.h> int main(int argc, char **argv) { char idstr[2232]; char buff[22128]; char processor_name[MPI_MAX_PROCESSOR_NAME]; int numprocs; int myid; int i; int namelen; MPI_Status stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name, &namelen); if(myid == 0) { printf("WE have %d processors\n", numprocs); for(i=1;i<numprocs;i++) { sprintf(buff, "Hello %d", i); MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); } for(i=1;i<numprocs;i++) { MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat); printf("%s\n", buff); } } else { MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat); sprintf(idstr, " Processor %d at node %s ", myid, processor_name); strcat(buff, idstr); strcat(buff, "reporting for duty\n"); MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD); } sleep(30); // new added to extend the running time MPI_Finalize(); }
But after recompiling and qsub again, the results look wrong. Processes are interrupted. in ex1.o35571:
The first job starts! The first job ends! The second job starts! The second job ends! WE have 5 processors WE have 5 processors Hello 1 Processor 1 at node node063 reporting for duty Hello 2 Processor 2 at node node169 reporting for duty Hello 3 Processor 3 at node node170 reporting for duty Hello 4 Processor 4 at node node171 reporting for duty Hello 1 Processor 1 at node node063 reporting for duty Hello 2 Processor 2 at node node169 reporting for duty Hello 3 Processor 3 at node node170 reporting for duty Hello 4 Processor 4 at node node171 reporting for duty 4 additional processes aborted (not shown) 4 additional processes aborted (not shown)
in ex1.e35571:
mpirun: killing job... mpirun noticed that job rank 0 with PID 25376 on node node062 exited on signal 15 (Terminated). mpirun: killing job... mpirun noticed that job rank 0 with PID 25377 on node node062 exited on signal 15 (Terminated).
I wonder why there is an interruption of processes? How can I do background jobs correctly in a PBS script?