Recent Changes - Search:

HomePage

PmWiki

pmwiki.org

MQueue

MQUEUE - the manual queuing system

This is a set of python scripts that maintain an external queue of job scripts. It is useful on the lxtccl2 cluster as the job scheduler on that machine is rather primitive and does not reorder jobs between different users. It would also be useful on other systems, whenever you have to manage more jobs than you can queue at one time.

NEWS:

2007-07-04: Added mqstat command. This works like qstat, but it also lists the pending jobs that MQueue hasn't submitted to the queue yet.

2006-11-30: Added the --top option to mqsub. This option causes the new job to be placed at the top of the queue.

2006-11-22: Python 2.5 is installed on all nodes of the cluster. It is therefore possible to use the mq programs directly from job scripts, without ssh to the frontend node.

2006-11-17: the scripts were updated to fix a problem where the slow update of 'qstat' could cause mq to think a job was still queued (and hence refuse to queue another) when in fact it had recently started running. This should not be possible now, so if it ever happens that you have jobs submitted to mqueue, but there is not at least one QUEUED job showing up in qstat, please let me know at ianmcc@physics.uq.edu.au

To use:

After installing (see below), the main commands are:

mqsub: this is a direct replacement for qsub. Any supplied options are passed through to qsub as necessary. Just like qsub, you can either supply the filename of a job script, or, if none is supplied (or the filename is '-'), the script is read from standard input. Mqsub copies the job script to a temporary location, so it is safe to overwrite or delete the job script after mqsub has run. In addition, mqsub modifies the job script slightly to insert some queue control commands, so that any pending jobs are automatically queued whenever possible. By default, jobs will run in the order that they are submitted. The --top option can be used to put the job to the top of the queue.

mqstat: This is just like qstat, but it also lists jobs that have been submitted with mqsub but have not yet been submitted to the queue. These show up as 'mqueue:pending' as the jobid.

mq: this command looks through the pending jobs and, if there is room available, adds them to the PBS queue. Usually it is not necessary to run this program directly, as it is called automatically by the job script.

There is also mqadd: this is a primitive version of mqsub that does not copy the job script, nor modify it to include any queue control commands. It just adds the script directly to the list of pending jobs. Normally, you would never use this command directly.

Installing

Grab the scripts from the Subversion repository:

svn co svn://lxtsma1.physik.rwth-aachen.de/ian/mptoolkit/trunk/scripts/mqueue

Or, from the web at http://www.physik.rwth-aachen.de/~ianmcc/mqueue.tar.gz

All of the files need to be accessible from $PATH, typically put them in $HOME/bin and make sure that directory is added to $PATH in your login scripts. (Alternatively, you could modify mqconfig.py to hardcode the location of the paths, as needed)

The files mq, mqadd, mqassert, mqassertadd, mqsub, mqstat should have execute permissions.

Configuration

All of the configuration is handled by mqconfig.py, this can be modified as necessary. The defaults should work fine.

Some important configuation items ( = default)

  • ScriptDir = ~/.mqueue/scripts : this is where mqsub stores the scripts. Periodically, you will need to clean it out.
  • RunQueueFile = ~/.mqueue/runqueue : this file contains the list of pending jobs
  • QueuedJobsFile = ~/.mqueue/running : this file contains a list of scripts that have been submitted to the PBS queue. It does not (yet!) track when jobs have finished, so you will need to periodically clean it out
  • FailedJobsFile = ~/.mqueue/failed : if there is some error starting the job (i.e. the qsub fails), the job script is removed from the RunQueueFile and added here. To attempt restarting the job, just copy the relevant line to the RunQueueFile.
  • MaxQueueLength = 1 : the maximum number of jobs we want in the QUEUED state

Discussion

  • When mqsub copies the job script, it ends up in ScriptDir with a unique (random) filename. This makes it a bit hard to figure out what original script corresponds to what pending script. Note that the job name (as appearing in qstat) is not affected, as mqsub goes to some lengths to get this correct.
  • It would be nice to track jobs that complete successfully, remove them from the QueuedJobsFile and add them to some sort of completed-jobs file. This would also facilitate removal of old scripts.
Edit - History - Print - Recent Changes - Search
Page last modified on December 06, 2010, at 08:38 PM