Compute Cluster Usage
Submitting jobs to the FMRIB SLURM compute cluster
Please see our job submission and monitoring section.
The new FMRIB cluster, Ood, uses the SLURM cluster software and the fsl_sub module now uses the SLURM cluster.
SLURM is significantly different from Grid Engine, in particular, there are no RAM limits for jobs. We STRONGLY recommend that you specify RAM (with fsl_sub's -R option) to ensure efficient use of the cluster, without it, all jobs will default to requesting 15GB of RAM. This also means that the -S/--noramsplit option is meaningless.
To assist with converting scripts that utilise fsl_sub with queue names for the OOD cluster we have provided a script. To enable the script use:
module add queue_migration
And then call with arguments of your original Grid Engine based script and the name of the script to create with time based partition selection.
queue_migration myscript.sh myscript_slurm.sh
If you are targeting the bigmem.q in your script then this will default to requesting 64GB of RAM. If you require more than this then use the --ram option to queue_migration.
Queue mapping
Jalapeno Queue | Ood Queue |
veryshort.q | short |
short.q | short |
long.q | long |
verylong.q | long |
bigmem.q | long (+ memory specifier) |
interactive.q | Reserved for remote desktop system Can launch interactive tasks on any of the normal queues |
gpu.q-----------------> | gpu_short |
|_________________> | gpu_long |
Multi-Threaded Tasks
fsl_sub's native options remain the same, but of note, SLURM does not support parallel environments, so when requesting multi-thread jobs slots you can use -s <number>. If you provide a parallel environment name this will be discarded, so existing scripts should continue to work as is.
Interactive GUI apps
Interactive tasks should be run via the new Open OnDemand virtual desktop facility.
How to run tasks on the cluster queues
Auto-submitting software
Some FSL commands and/or GUIs automatically queue themselves where appropriate, i.e., you do not need to use 'fsl_sub' to submit these programs.
Please note that this list may not be exhaustive, so you may come across more commands which have been adapted to queue themselves. If you do submit one of these tools to the queues then they will still run, but may not be able to make full use of the cluster resources (e.g. not be able to run multiple tasks in parallel).
Other commands run from the terminal command line will need to use the `fsl_sub` command, described below, to submit them to the queue.
module add fsl
These lines can be added to your .bash_profile to ensure they take effect for every login session you have.
Submitting jobs with fsl_sub
Typing fsl_sub before the rest of your command will send the job to the cluster. By default, this will submit your task to the short partition. fsl_sub can automatically choose a queue for you if you provide information about your job's requirements - we would strongly recommend that you provide at least an estimated maximum run time (--jobtime) to allow SLURM to efficiently schedule job.
There are several ways to select a queue:
- Use the -R (--jobram) and -T (--jobtime) options to fsl_sub to specify the maximum memory and run-time requirements for your job (in GB and minutes of wall time*) respectively. fsl_sub will then select the most appropriate queue for you.
GPU tasks can be requested using the --coprocessor options (see the Running GPU Tasks section). - Specify a specific partition with the -q (--queue) option. For further information on the available queues and which to use when see the queues section.
- The command you want to run on the queues must be in your path - this does NOT include the current folder. If it isn't then you must specify the path to the command; commands/scripts in the current folder must be prefixed with './', e.g. ''./script''.
- The FMRIB SLURM cluster does not have a 'verylong' or 'bigmem' equivalent queue. See Long Running Tasks below.
- Jobs submitted to the FMRIB SLURM cluster do NOT inherit the 'environment' of your login shell, e.g. environment variables such as FSLDIR are not copied over to your job. Load software configuration (such as FSL) from shell modules or use the '--export' option to fsl_sub to copy the variables to your job (see Passing Envrionment Variables to Queued Jobs).
- Wall Time: Unlike the FMRIB Jalapeno cluster (which uses CPU time) the SLURM cluster measures job run-time in real time, often called wall time (as in the time on a clock on the wall).
To assess the time necessary for your job to complete you can look at the run-times of similar previous jobs using the 'sacct' command (see Monitoring Tasks).
Example Usage
To queue a job which requires 10GiB of memory and runs for 2 hours use:
fsl_sub -T 120 -R 10 ./myjob
FSLSUB_MEMORY_REQUIRED=32G feat mydesign.feat
would submit a FEAT task informing fsl_sub that you expect to require 32GiB of memory. If units aren't specified then the integer is assumed to be in the units specified in the configuration file (default GiB).
The different partitions have different run-times and memory limits, when a task reaches these limits it will be terminated; also shorter queues take precedence over the longer ones. It is advantageous to provide the scheduler with as much information about your job's memory and time requirements.
The command you submit cannot run any graphical interface, as they will have no where to display the output.
If you want to run a non-interactive MATLAB task on the queues then see MATLAB jobs.
fsl_sub Options
To see a full list of the available options use:
fsl_sub --help
In addition to the list of options this will also display a list of partitions available for use with descriptions of allowed run times and memory availability. For details on how to use these options see the Advanced Usage section.
Long running tasks
Unlike the Jalapeno cluster, the SLURM cluster does not offer 'infinite' partitions (equivalent to verylong.q, bigmem.q and the cuda.q on the jalapeno cluster). You must break your task up into shorter components or regularly save state to allow restart and submit these parts (or resubmit the job continuing where it left off) using job holds to prevent tasks running before the previous one completes.
How to monitor the progress of your submitted job
Please see BMRC documentation on job monitoring: Checking or deleting running jobs
How to pass environment variables to SLURM jobs
By default no environment variables from your current shell are passed to your job running on the cluster.
Where the important variables were set by loading an environment module, you do not need to do anything as fsl_sub will automatically load the currently loaded modules in your cluster job, but for other variables you can request that fsl_sub pass a sub-set of variables to your job with the --export option (pass this multiple times to export multiple variables). You can also use this option to set an environment variable in your job that are not already set in your shell.
For very complicated use cases or dynamic variable setting, create a script that sets up all your variables and then calls the software - submit this script to the cluster.
There are two ways to use --export:
- --export VARIABLENAME {--export VARIABLENAME} This will copy the current environment variable setting into your job (specify multiple times for multiple variables)
- --export VARIABLENAME=VARIABLEVALUE This will set the environment variable to the value after the '=' in the queued job only (not effecting your shell) so is ideal where you need to specify a job specific value
fsl_sub will automatically transfer some internal variables and may have been configured to include some additional useful ones, see the 'exports:' option in the output of
fsl_sub --show_config
for the list of default exports. Any --export passed on the commandline will override these configured options. It is also possible to configure fsl_sub for your account to always copy over particular variables, see Configuring fsl_sub.
Option 2, where you provide the variable with a value is particularly useful if you are scheduling many similar tasks and need to specify a different value for an environment variable for each run, for example SUBJECTS_DIR which FreeSurfer uses to specify where your data sets reside.
Available SLURM partitions
There are currently five partitions (often called queues on other platforms) configured in fsl_sub:
Queue | Duration | Max Memory | Default Memory | Purpose |
---|---|---|---|---|
short | 1.2 days | 1TB | 15GB | batch jobs |
long | 10 days | 1TB | 15GB | batch jobs |
interactive | 10 days | 256GB | 15GB | interactive jobs |
gpu_short | 4h | 94GB/gpu (A30) 48GB/gpu (H100) |
48GB | GPU batch jobs |
gpu_long | 60h | 94GB/gpu (A30) 48GB/gpu (H100) |
48GB | GPU batch and interactive jobs |
More advanced techniques for submitting jobs, e.g. GPU, array and MATLAB tasks and full fsl_sub usage information
If your task comprises a complicated pipeline of interconnected tasks there are several options for splitting into dependent tasks or parallelisation of independent portions across many cluster nodes. Information on these techniques and other advance options is in this section.
Which tools automatically submit themselves to a cluster queue
Which tools automatically use a cluster?
The following programs/scripts are able to self-submit in a HPC cluster and should not be used in conjunction with fsl_sub.
Scripts that self-submit | |
FDT |
bedpostx |
FEAT |
|
FIRST |
|
FSLVBM |
|
POSSUM |
|
RANDOMISE |
|
TBSS |
|
GUIs that self-submit | |
FDT |
|
FEAT |
|
FLIRT |
|
POSSUM |
|
Note that all other FSL GUIs will only run jobs on the local machine, to submit to a cluster you must use the equivalent command-line call in conjunction with fsl_sub.
How to troubleshoot failed jobs
Occasionally tasks will fail. When tasks begin running they generate two files, jobname.ojobid (referred to as the .o file) and jobname.ejobid (referred to as the .e file), which by default are created in the folder from which fsl_sub was run (or where your specified on the command line - FEAT tasks will create a logs folder within the .(g)feat folder).
The .o file contained any text that the program writes to the console whilst running, for example:
fsl_sub ls my_folder
outputs the job id ''12345''. The task would generate a file ls.o12345 containing the folder listing for my_folder.
If your command produces a running commentary of its progress you could monitor this with the tail command:
tail -f command.o12345
This will continue displaying the contents of command.o12345, adding new content as it arrived until you exit (type CTRL-c).
The .e file contains any error messages the command produced. If you still need help then please contact the IT Team.
Killing jobs
If, after submitting a job you release that there is a problem, you can kill the job with the command
scancel job_id
If the job is currently running there may be a short pause whilst the task is terminated.