Queued Analysis Tasks
How to submit jobs to the FMRIB cluster and how to monitor them
Introduction
WIN operate a compute cluster formed from rack-mounted multi-core computers. To ensure eficient use of the hardware, tasks are distributed amongst these computers using grid scheduling software. This software monitors the utilisation of the computers in the cluster, launching new jobs onto the least used computers, preventing over loading of machines whilst ensuring a fair share of compute resources amongst all users of the system. When you submit a job it will sit in a queue until such time as the scheduler software identifies a viable empty slot and your job has reached the top of the queue. The fair share algorithm in use ensures that heavy users of the system are less likely to reach the top than users who rarely use the system (this is cleared on a regular basis so that you aren't deprioritised forever).
Grid Engine and the queues
WIN's cluster runs the Grid Engine (GE) queuing software (using the Son of Grid Engine distribution). To ease job submission we provide a helper called fsl_sub which sets some useful options to Grid Engine's built-in qsub command.
GE manages a set of queues representing the available resources. Tasks are submitted to GE queues for distribution across the execution hosts. These queues are designed to divide the resources according to usage profiles to ensure that the majority of tasks get done in a favourable time-frame (see Jalapeno Queues).
The Jalapeno Cluster
Submitting a job to the FMRIB cluster
Submitting jobs with fsl_sub
- Use the -R (--jobram) and -T (--jobtime) options to fsl_sub to specify the maximum memory and run-time requirements for your job (in GB and minutes of CPU time*) respectively. fsl_sub will then select the most appropriate queue for you. GPU tasks can be requested using the --coprocessor options (see the Running GPU Tasks section below).
- Specify a specific queue with the -q (--queue) option. For further information on the available queues and which to use when see the queues section.
Specifying CPU Time
Requesting memory for automatically queued jobs
FSL_SUB OPTIONS
To see a full list of the available options use:
fsl_sub --help
In addition to the list of options this will also display a list of cluster queues available for use with descriptions of allowed runtimes and memory/parallel environment availability. For details on how to use these options see the Advanced Usage section.
LONG RUNNING TASKS
Whilst we provide queues with infinte run times (verylong.q, bigmem.q and the cuda.q queues) we strongly recommend that you attempt to break your task up into shorter components where possible - there are many more slots on the shorter queues and tasks running for many weeks or months are at risk of loss due to power cuts or server faults. Where chunking the analysis is not possible you should investigate whether it is possible to save job state at regular points (often called checkpointing) in such a way that the job can be restarted at a checkpoint without loosing work carried out to that point. If the program supports this behaviour then you could submit several runs to finite queues with job holds in place to allow the job to run to completion with regular restarts.
How to check on the progress of your jobs
Monitoring tasks
The Grid Engine software provides the qstat command for listing the state of all the queues, by default only listing jobs that you have submitted. Use:
qstat -help
for a list of all the options. If you want to see the overall state of the queues, including everyone else's tasks, then use:
qstat -u \*
INTERPRETING QSTAT'S OUTPUT
The qstat listing has several columns indicating the jobid, priority, owner, taskname, status and which queue the task was submitted to or is running on. The most important details are described below:
- Priority - A floating point number in the second column of output indicates the relative priority of each task. The task with the highest priority, shown at the top of the pending list, will be the next one to get access to any available resources.
- State - A string of characters EhqwRrdTsS indicating the following conditions:
State characters | Meaning |
E | The task is in the Error state. Contact IT Support Staff for help. |
h | Job is held until completion of some other task. Use qstat -j <jobid> to find out which task(s) it is dependent on. |
qw | Job is in the queued and waiting or pending state. This task will be submitted to an execution host as soon as one becomes available and the task priority is the highest of those in the pending state. |
r | This task is running. An extra field indicates which actual execution host the task is running on, e.g. long.q@jalapeno01.cluster.fmrib.ox.ac.uk etc. |
R | Re-scheduled. This will usually mean one of the operators has restarted the task, perhaps because a node crashed. Contact IT Support Staff for a full explanation. For some jobs this will result in failure or corrupted output so please check the output of these tasks carefully. |
d | Deleted. This job has been scheduled for deletion. |
dr | Deleted but still running. This happens when a job is deleted but the node which was running the task isn't responding to the request to remove the task. You should contact computing-help@win.ox.ac.uk. |
s/S/T | Suspended. This job has been temporarily suspended. Probably due to a machine becoming overloaded with higher priority tasks. It will resume once the load reduces sufficiently:
|
t | Transitioning. This job is starting. |
EXAMINING COMPLETED JOBS
Once a job completes, qstat will no longer be able to find the job id. You can now query the cluster software using the qacct command.
N.B. Due to the number of jobs submitted to our cluster for performance reasons the database of completed jobs is regularly rotated. We provide a command qacct-all which will call qacct on all the archived job databases, try this if qacct does not return information on your job.
qacct takes several options but the most useful one is '-j <jobid>' which returns information on the provided job id. Of the information this command provides the most useful entries are:
Entry | Purpose |
---|---|
qname | Name of queue this ran on |
hostname | Name of node job ran on - useful for IT Help to troubleshoot issues |
start/end_time | Start and end time (real) - useful if there was a known issue at a particular time |
slots | How many parallel environment slots/threads your job had |
failed & exit_status | Whether the cluster software thinks the job failed and the exit status of the job N.B. There are many ways a job can fail but the cluster will not be aware so this is not necessarily proof that a job completed successfully |
ru_wallclock | Real time run-time of job |
cpu | CPU time of job (seconds) |
maxvmem | Maximum memory job required (units given) |
QSTAT EXAMPLES
3 joebloggs@jalapeno $ qstat -u \* job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 28704 0.55002 Connectivi moonunit r 04/23/2006 08:28:23 long.q@jalapeno10.cluster.fmrib.ox.ac.uk 1 28714 0.55002 feat_aoEky apple r 04/23/2006 05:25:23 long.q@jalapeno23.cluster.fmrib.ox.ac.uk 1 23668 0.55013 feat_aftwD heavenly dr 06/22/2005 22:28:16 long.q@jalapeno23.cluster.fmrib.ox.ac.uk 1 28706 0.55002 Connectivi moonunit R 04/21/2006 20:19:35 long.q@jalapeno01.cluster.fmrib.ox.ac.uk 1 28707 0.55002 Connectivi moonunit qw 04/21/2006 20:19:41 1 28673 0.55002 bedpost brooklyn qw 04/20/2006 15:49:43 11-42:1 27378 0.00000 STDIN geronimo hqw 02/10/2006 16:53:47 1 28544 0.00000 Franklin fuschia Eqw 04/06/2006 20:08:14 1 28674 0.00000 bp_postpro brooklyn hqw 04/20/2006 15:49:43 1
TO SEE A LISTING OF JUST THE "RUNNING" JOBS:
4 joebloggs@jalapeno $ qstat -u \* -s r job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 28704 0.55002 Connectivi moonunit r 04/23/2006 08:28:23 long.q@jalapeno10.fmrib.ox.ac.uk 1 28714 0.55002 feat_aoEky apple r 04/23/2006 05:25:23 long.q@jalapeno23.cluster.fmrib.ox.ac.uk 1 28706 0.55002 Connectivi moonunit R 04/21/2006 20:19:35 long.q@jalapeno01.cluster.fmrib.ox.ac.uk 1
TO SEE THE STATE OF ONLY THOSE JOBS BELONGING TO A PARTICULAR USER:
5 joebloggs@jalapeno $ qstat -u apple job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 28714 0.55002 feat_aoEky apple r 04/23/2006 05:25:23 long.q@jalapeno01.cluster.fmrib.ox.ac.uk 1 28721 0.55002 feat_afpcn apple r 04/23/2006 18:42:53 long.q@jalapeno1.cluster.fmrib.ox.ac.uk 1
TO SEE THE FULL STATE OF A PARTICULAR TASK:
qstat -j 14000
==============================================================
job_number: 14000
exec_file: job_scripts/14000
submission_time: Thu Apr 27 10:09:43 2006
owner: apple
uid: 123456
group: apple-group
gid: 987654
sge_o_home: /home/apple
sge_o_log_name: dcm
sge_o_path: /tmp/13999.1.short.q:/home/apple/bin:.....
sge_o_shell: /bin/bash
sge_o_workdir: /vols/Scratch/apple
sge_o_host: jalapeno02
account: sge
cwd: /vols/Scratch/apple
path_aliases: /private/automount/ * * /
stderr_path_list: /home/apple/.fsltmp
mail_list: apple@jalapeno02.cluster.fmrib.ox.ac.uk
notify: FALSE
job_name: feat
stdout_path_list: /home/apple/.fsltmp
jobshare: 0
hard_queue_list: long.q
env_list: MANPATH=/home/apple/man:/opt/sge/.....
job_args: /home/apple/.fsltmp/feat_4HlloR,1
script_file: /opt/fmrib/fsl/bin/feat
usage 1: cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
scheduling info:
AVAILABLE QUEUES
WHAT QUEUES DOES THE JALAPENO CLUSTER PROVIDE AND WHAT ARE THEY FOR?
The jalapeno cluster provides four primary queues: ''short.q'', ''long.q'' and ''verylong.q'' and two special purpose queues, ''bigmem.q'' and ''interactive.q''. By choosing the most appropriate queue you can gain access to more resources so it pays to think about the right queue to use. Further, if you choose a queue that has resource limits and your job exceeds this (time or memory) then your task will be killed, wasting compute time.
NB To select a particular queue using the ''fsl_sub'' command use the ''-q <queue-name>'' option or use the -T and -R or --coprocessor* options to automatically select the queue to use.
NB The time limits we specify below refer to CPU time - this is '''NOT''' real-time. Because the compute cluster is shared, a job often gets a fraction of the available time on the CPU so a job that actually takes 1 hour to run may only have used 25 minutes of CPU time.
QUEUE DETAILS
- veryshort.q - 30 minutes max CPU Provides a set of slots for very quick tasks. Provides plenty of highly available compute power on the cluster. Use these as much as possible to get your jobs off the shared login servers. RAM usage limited to 12GB.
- short.q - 4 hrs max CPU Run brief tasks, i.e. less than 4 hours CPU run time, on this queue. The short queues take precedence over all other queues so if your task fits on this queue it would be in your best interests to run it here. RAM usage limited to 12GB.
- long.q - 48 hrs max CPU The long.q is the default queue. Tasks can run for a maximum of 24 hours CPU time (see tip above). RAM usage limited to 12GB. Most of the FSL software runs in this sort of time frame with the possible exception of group '''FEAT''' tasks. In this case the '''FEAT''' scripts have been written to ensure the right queues get used.
- verylong.q - unlimited CPU time at low priority An unlimited duration queue with the lowest priority. Tasks which will take longer than 24 hours must be run here. These tasks get the lowest priority under the assumption that there will be plenty of spare CPU (esp. overnight) to ensure they run in a sensible time frame. RAM usage limited to 12GB.
- bigmem.q - targets machines with larger memory capabilities The bigmem.q is for running large memory footprint tasks. It targets machines with large amounts of RAM and should be picked if you feel your analysis task is going to need unusual amounts of RAM. Currently there isn't a simple way of determining the memory footprint so you'll have to learn the hard way, i.e., through jobs otherwise running out of memory. Please seek assistance if this is the case.
- interactive.q - interactive tasks Where you just can't run a task without interaction, for example you have to press a start button in a window, then we offer an interactive queue this cannot be used as a ''fsl_sub'' target. See interactive queue for further details.
- cuda.q - targets machines with NVIDIA GPU hardware. Use the --coprocessor options to configure this resource (see GPU tasks in the Advanced Usage section). This queue has no limits but please limit long running tasks as this is significantly more restricted resource.
- lcmodel - targets the host with the LCModel spectroscopy software installed.
How to request and use an interactive queued session
Where your program requires interaction we offer an interactive queue which can be used to get a terminal session on one of the cluster nodes.
Request & Warning: You MUST log out AS SOON AS you finish using the interactive session as this uses up a slot on the cluster and so prevents other users from using the cluster.
To request a terminal session, issue the following command on jalapeno.fmrib.ox.ac.uk
qlogin -q interactive.q
There may be a delay whilst the system finds a suitable host. Once one becomes available, if this is the first time you have logged into a particular node you may be asked to accept the host key. Enter `yes` to accept this host key and then you will be presented with a terminal session.
At this point (assuming you enabled X11 tunnelling to ''jalapeno.fmrib.ox.ac.uk'') you should be able to run graphical X11 programs as well as terminal commands.
What queues are available and what to use them for
Available Queues
The jalapeno cluster provides four primary queues: ''short.q'', ''long.q'' and ''verylong.q'' and two special purpose queues, ''bigmem.q'' and ''interactive.q''. By choosing the most appropriate queue you can gain access to more resources so it pays to think about the right queue to use. Further, if you choose a queue that has resource limits and your job exceeds this (time or memory) then your task will be killed, wasting compute time.
N.B. To select a particular queue using the ''fsl_sub'' command use the ''-q <queue-name>'' option or use the -T and -R or --coprocessor* options to automatically select the queue to use.
N.B. The time limits we specify below refer to CPU time - this is NOT real-time. Because the compute cluster is shared, a job often gets a fraction of the available time on the CPU so a job that actually takes 1 hour to run may only have used 25 minutes of CPU time.
Queue | Max Runtime | Max RAM (GB) | Usage |
veryshort.q | 30 mins | 16 | Very quick tasks. Largest number of slots. Use these as much as possible to get your jobs off the shared login servers |
short.q | 4h | 16 | Brief tasks. The short/veryshort queues take precedence over all other queues so if your task fits on this queue it would be in your best interests to run it here |
long.q | 48h | 16 | The default queue. Tasks can run for a maximum of 48 hours CPU time. Most of the FSL software runs in this sort of time frame with the possible exception of some large group FEAT tasks. |
verylong.q | infinite | 12 | Lowest priority and limited number of slots. Tasks which will take longer than 48 hours must be run here. These tasks get the lowest priority under the assumption that there will be plenty of spare CPU (esp. overnight) to ensure they run in a sensible time frame. |
bigmem.q | infinite | ~300 | The bigmem.q is for running large memory footprint tasks. There are very few of these slots, which may use any available RAM on a host. Avoid unless necessary, and please seek assistance before using. |
cuda.q | infinite | ~200 | Targets machines with NVIDIA GPU hardware. Use the --coprocessor options to configure this resource (see GPU tasks in the Advanced Usage section). This queue has no limits but please limit long running tasks as this is significantly more restricted resource. |
interactive.q | infinite | 16 | Where you just can't run a task without interaction, for example you have to press a start button in a window, then we offer an interactive queue. This cannot be used as a fsl_sub target. See interactive queue for further details. |
More advanced techniques for submitting jobs, e.g. GPU, array and MATLAB tasks and full fsl_sub usage information
If your task comprises a complicated pipeline of interconnected tasks there are several options for splitting into dependent tasks or parallelisation of independent portions across many cluster nodes. Information on these techniques and other advance options is in this section.
How to terminate jobs and solve submission/runtime problems
Occasionally tasks will fail. Grid Engine provides some tools and logging information for debugging such tasks.
When tasks begin running they generate two files, jobname.ojobid (e.g. feat.o12345) (referred to as the .o file) and jobname.ejobid (referred to as the .e file), which by default are created in the folder from which fsl_sub was run. The .o file contains any text that the program writes to the console whilst running, for example:
fsl_sub ls my_folder
outputs the job id ''12345''. The task would generate a file ls.o12345 containing the folder listing for my_folder. If your command produces a running commentary of its progress you could monitor this with the tail command:
tail -f command.o12345
This will continue displaying the contents of command.o12345, adding new content as it arrives until you exit (type CTRL-c). The .e file contains any error messages the command produced. If you still need help then please contact the IT Team.
KILLING JOBS
If, after submitting a job you realise that you have made a mistake or that the job can't complete (perhaps you have insufficient disk space), you can kill the job with the command:
qdel job_id
If the job is currently running there may be a short pause whilst the task is terminated.