Monitoring Tasks
How to check on the progress of your jobs
Monitoring tasks
The Grid Engine software provides the qstat command for listing the state of all the queues, by default only listing jobs that you have submitted. Use:
qstat -help
for a list of all the options. If you want to see the overall state of the queues, including everyone else's tasks, then use:
qstat -u \*
INTERPRETING QSTAT'S OUTPUT
The qstat listing has several columns indicating the jobid, priority, owner, taskname, status and which queue the task was submitted to or is running on. The most important details are described below:
- Priority - A floating point number in the second column of output indicates the relative priority of each task. The task with the highest priority, shown at the top of the pending list, will be the next one to get access to any available resources.
- State - A string of characters EhqwRrdTsS indicating the following conditions:
State characters | Meaning |
E | The task is in the Error state. Contact IT Support Staff for help. |
h | Job is held until completion of some other task. Use qstat -j <jobid> to find out which task(s) it is dependent on. |
qw | Job is in the queued and waiting or pending state. This task will be submitted to an execution host as soon as one becomes available and the task priority is the highest of those in the pending state. |
r | This task is running. An extra field indicates which actual execution host the task is running on, e.g. long.q@jalapeno01.cluster.fmrib.ox.ac.uk etc. |
R | Re-scheduled. This will usually mean one of the operators has restarted the task, perhaps because a node crashed. Contact IT Support Staff for a full explanation. For some jobs this will result in failure or corrupted output so please check the output of these tasks carefully. |
d | Deleted. This job has been scheduled for deletion. |
dr | Deleted but still running. This happens when a job is deleted but the node which was running the task isn't responding to the request to remove the task. You should contact computing-help@win.ox.ac.uk. |
s/S/T | Suspended. This job has been temporarily suspended. Probably due to a machine becoming overloaded with higher priority tasks. It will resume once the load reduces sufficiently:
|
t | Transitioning. This job is starting. |
EXAMINING COMPLETED JOBS
Once a job completes, qstat will no longer be able to find the job id. You can now query the cluster software using the qacct command.
N.B. Due to the number of jobs submitted to our cluster for performance reasons the database of completed jobs is regularly rotated. We provide a command qacct-all which will call qacct on all the archived job databases, try this if qacct does not return information on your job.
qacct takes several options but the most useful one is '-j <jobid>' which returns information on the provided job id. Of the information this command provides the most useful entries are:
Entry | Purpose |
---|---|
qname | Name of queue this ran on |
hostname | Name of node job ran on - useful for IT Help to troubleshoot issues |
start/end_time | Start and end time (real) - useful if there was a known issue at a particular time |
slots | How many parallel environment slots/threads your job had |
failed & exit_status | Whether the cluster software thinks the job failed and the exit status of the job N.B. There are many ways a job can fail but the cluster will not be aware so this is not necessarily proof that a job completed successfully |
ru_wallclock | Real time run-time of job |
cpu | CPU time of job (seconds) |
maxvmem | Maximum memory job required (units given) |
QSTAT EXAMPLES
3 joebloggs@jalapeno $ qstat -u \* job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 28704 0.55002 Connectivi moonunit r 04/23/2006 08:28:23 long.q@jalapeno10.cluster.fmrib.ox.ac.uk 1 28714 0.55002 feat_aoEky apple r 04/23/2006 05:25:23 long.q@jalapeno23.cluster.fmrib.ox.ac.uk 1 23668 0.55013 feat_aftwD heavenly dr 06/22/2005 22:28:16 long.q@jalapeno23.cluster.fmrib.ox.ac.uk 1 28706 0.55002 Connectivi moonunit R 04/21/2006 20:19:35 long.q@jalapeno01.cluster.fmrib.ox.ac.uk 1 28707 0.55002 Connectivi moonunit qw 04/21/2006 20:19:41 1 28673 0.55002 bedpost brooklyn qw 04/20/2006 15:49:43 11-42:1 27378 0.00000 STDIN geronimo hqw 02/10/2006 16:53:47 1 28544 0.00000 Franklin fuschia Eqw 04/06/2006 20:08:14 1 28674 0.00000 bp_postpro brooklyn hqw 04/20/2006 15:49:43 1
TO SEE A LISTING OF JUST THE "RUNNING" JOBS:
4 joebloggs@jalapeno $ qstat -u \* -s r job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 28704 0.55002 Connectivi moonunit r 04/23/2006 08:28:23 long.q@jalapeno10.fmrib.ox.ac.uk 1 28714 0.55002 feat_aoEky apple r 04/23/2006 05:25:23 long.q@jalapeno23.cluster.fmrib.ox.ac.uk 1 28706 0.55002 Connectivi moonunit R 04/21/2006 20:19:35 long.q@jalapeno01.cluster.fmrib.ox.ac.uk 1
TO SEE THE STATE OF ONLY THOSE JOBS BELONGING TO A PARTICULAR USER:
5 joebloggs@jalapeno $ qstat -u apple job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 28714 0.55002 feat_aoEky apple r 04/23/2006 05:25:23 long.q@jalapeno01.cluster.fmrib.ox.ac.uk 1 28721 0.55002 feat_afpcn apple r 04/23/2006 18:42:53 long.q@jalapeno1.cluster.fmrib.ox.ac.uk 1
TO SEE THE FULL STATE OF A PARTICULAR TASK:
qstat -j 14000
==============================================================
job_number: 14000
exec_file: job_scripts/14000
submission_time: Thu Apr 27 10:09:43 2006
owner: apple
uid: 123456
group: apple-group
gid: 987654
sge_o_home: /home/apple
sge_o_log_name: dcm
sge_o_path: /tmp/13999.1.short.q:/home/apple/bin:.....
sge_o_shell: /bin/bash
sge_o_workdir: /vols/Scratch/apple
sge_o_host: jalapeno02
account: sge
cwd: /vols/Scratch/apple
path_aliases: /private/automount/ * * /
stderr_path_list: /home/apple/.fsltmp
mail_list: apple@jalapeno02.cluster.fmrib.ox.ac.uk
notify: FALSE
job_name: feat
stdout_path_list: /home/apple/.fsltmp
jobshare: 0
hard_queue_list: long.q
env_list: MANPATH=/home/apple/man:/opt/sge/.....
job_args: /home/apple/.fsltmp/feat_4HlloR,1
script_file: /opt/fmrib/fsl/bin/feat
usage 1: cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
scheduling info:
AVAILABLE QUEUES
WHAT QUEUES DOES THE JALAPENO CLUSTER PROVIDE AND WHAT ARE THEY FOR?
The jalapeno cluster provides four primary queues: ''short.q'', ''long.q'' and ''verylong.q'' and two special purpose queues, ''bigmem.q'' and ''interactive.q''. By choosing the most appropriate queue you can gain access to more resources so it pays to think about the right queue to use. Further, if you choose a queue that has resource limits and your job exceeds this (time or memory) then your task will be killed, wasting compute time.
NB To select a particular queue using the ''fsl_sub'' command use the ''-q <queue-name>'' option or use the -T and -R or --coprocessor* options to automatically select the queue to use.
NB The time limits we specify below refer to CPU time - this is '''NOT''' real-time. Because the compute cluster is shared, a job often gets a fraction of the available time on the CPU so a job that actually takes 1 hour to run may only have used 25 minutes of CPU time.
QUEUE DETAILS
- veryshort.q - 30 minutes max CPU Provides a set of slots for very quick tasks. Provides plenty of highly available compute power on the cluster. Use these as much as possible to get your jobs off the shared login servers. RAM usage limited to 12GB.
- short.q - 4 hrs max CPU Run brief tasks, i.e. less than 4 hours CPU run time, on this queue. The short queues take precedence over all other queues so if your task fits on this queue it would be in your best interests to run it here. RAM usage limited to 12GB.
- long.q - 48 hrs max CPU The long.q is the default queue. Tasks can run for a maximum of 24 hours CPU time (see tip above). RAM usage limited to 12GB. Most of the FSL software runs in this sort of time frame with the possible exception of group '''FEAT''' tasks. In this case the '''FEAT''' scripts have been written to ensure the right queues get used.
- verylong.q - unlimited CPU time at low priority An unlimited duration queue with the lowest priority. Tasks which will take longer than 24 hours must be run here. These tasks get the lowest priority under the assumption that there will be plenty of spare CPU (esp. overnight) to ensure they run in a sensible time frame. RAM usage limited to 12GB.
- bigmem.q - targets machines with larger memory capabilities The bigmem.q is for running large memory footprint tasks. It targets machines with large amounts of RAM and should be picked if you feel your analysis task is going to need unusual amounts of RAM. Currently there isn't a simple way of determining the memory footprint so you'll have to learn the hard way, i.e., through jobs otherwise running out of memory. Please seek assistance if this is the case.
- interactive.q - interactive tasks Where you just can't run a task without interaction, for example you have to press a start button in a window, then we offer an interactive queue this cannot be used as a ''fsl_sub'' target. See interactive queue for further details.
- cuda.q - targets machines with NVIDIA GPU hardware. Use the --coprocessor options to configure this resource (see GPU tasks in the Advanced Usage section). This queue has no limits but please limit long running tasks as this is significantly more restricted resource.
- lcmodel - targets the host with the LCModel spectroscopy software installed.