SLURM emulation on Ubuntu 16.04

I want to imitate SLURM on Ubuntu 16.04. I don't need serious resource management, I just want to check out some simple examples. I cannot install SLURM in the usual way , and I wonder if there are other options. Other things I've tried:

  • A Image of dockers . Unfortunately docker pull agaveapi/slurm; docker run agaveapi/slurm docker pull agaveapi/slurm; docker run agaveapi/slurm gives me errors:

    /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord runs as root and looks for the default configuration file (including the current working directory); you probably want to specify the -c argument to specify the absolute path to the configuration file for added security. "The supervisor is running as root and it is looking for" 2017-10-29 15: 27: 45,436 The CRIT controller is running as root (there is no user in the configuration file) 2017-10-29 15: 27: 45,437 The initial INFO control started with pid 1 2017-10-29 15: 27: 46,439 INFO generated: 'slurmd' with pid 9 2017-10-29 15: 27: 46,441 INFO generated: 'sshd' with pid 10 2017-10-29 15: 27: 46,443 INFO generated : "munge" with pid 11 2017-10-29 15: 27: 46,443 INFO spawned: "slurmctld" with pid 12 2017-10-29 15: 27: 46,452 INFO exited: munge (exit status 0, not expected) 2017- 10-29 15: 27: 46,452 CRIT received an unknown pid 13) 2017-10-29 15: 27: 46,530 INFO refused: munge entered the FATAL state, it starts too many too many repetitions 2017-10-29 15: 27 : 46,531 INFO exited: slurmd (exit status 1, not expected) 2017-10-29 15: 27: 46,535 INFO refused: slurmd entered the FATAL state, too many attempts to start too quickly 2017-10-29 15: 27: 46,536 INFO exited: slurmctld (exit status 0, not expected) 2017-10-29 15: 27: 47,537 INFO success: sshd entered the RUNNING state, the process remained> for more than 1 second (startecs) 2017-10-29 15: 27: 47,537 INFO refused: slurmctld entered the FATAL state, it attempts too much too quickly.

  • This is a guide to starting VM SLURM through Vagrant . I tried, but copying on my munge key expired.

    sudo scp / etc / munge / munge.key vagrant @server: / home / vagrant / ssh: connect to host server port 22: connection timeout lost connection

0
source share
2 answers

So ... we have an existing cluster, but it launches an older version of Ubuntu, which does not work very well with my workstation running 04.17.

So, on my workstation, I just made sure slurmctld (backend) and slurmd , and then set up the trivial slurm.conf with

 ControlMachine=mybox # ... NodeName=DEFAULT CPUs=4 RealMemory=4000 TmpDisk=50000 State=UNKNOWN NodeName=mybox CPUs=4 RealMemory=16000 

after which I restarted slurmcltd and then slurmd . Now everything is all right:

 root@mybox :/etc/slurm-llnl$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST demo up infinite 1 idle mybox root@mybox :/etc/slurm-llnl$ 

This is a degenerative installation, our present one has a mixture of dev and prod machines and related sections. But this should answer your question "can a customer be a customer." Also, my car is not really called mybox , but it really is not relevant to the issue in any way.

Using Ubuntu 17.04, all stocks, with munge for communication (this is the default anyway).

Edit: for wit:

 me@mybox :~$ COLUMNS=90 dpkg -l '*slurm*' | grep ^ii ii slurm-client 16.05.9-1ubun amd64 SLURM client side commands ii slurm-wlm-basic- 16.05.9-1ubun amd64 SLURM basic plugins ii slurmctld 16.05.9-1ubun amd64 SLURM central management daemon ii slurmd 16.05.9-1ubun amd64 SLURM compute node daemon me@mybox :~$ 
+1
source

I would rather run SLURM initially, but I crashed and deployed a Debian 9.2 virtual machine. See here for my troubleshooting efforts in my own installation. worked smoothly here , but I needed to make the following changes to slurm.conf . Below, Debian64 is the hostname , and wlandau is my username.

  • ControlMachine=Debian64
  • SlurmUser=wlandau
  • NodeName=Debian64

Here is the full slurm.conf . A similar slurm.conf did not work on my native Ubuntu 16.04.

 # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=Debian64 #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=wlandau #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=Debian64 CPUs=1 RealMemory=744 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN PartitionName=debug Nodes=Debian64 Default=YES MaxTime=INFINITE State=UP 
0
source

Source: https://habr.com/ru/post/1272929/


All Articles