Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Submitting jobs to the FMRIB SLURM compute cluster

​The new FMRIB cluster uses the SLURM cluster software and the fsl_sub module now uses the SLURM cluster.

​​​​SLURM is significantly different from Grid Engine, in particular, there are no RAM limits for jobs. We STRONGLY recommend that you specify RAM (with fsl_sub's -R option) to ensure efficient use of the cluster, without it, all jobs will will default to requesting 15GB of RAM. This also means that the -S/--noramsplit option is meaningless.

fsl_sub's native options remain the same, but of note, SLURM does not support parallel environments, so when requesting multi-thread jobs slots you can use -s <number>. If you provide a parallel environment name this will be discarded, so existing scripts should continue to work as is.

Interactive tasks should be run via the new Open OnDemand virtual desktop facility.

How to run tasks on the cluster queues

Auto-submitting software

Some FSL commands and/or GUIs automatically queue themselves where appropriate, i.e., you do not need to use 'fsl_sub' to submit these programs.

Please note that this list may not be exhaustive, so you may come across more commands which have been adapted to queue themselves.  If you do submit one of these tools to the queues then they will still run, but may not be able to make full use of the cluster resources (e.g. not be able to run multiple tasks in parallel).

Other commands run from the terminal command line will need to use the `fsl_sub` command, described below, to submit them to the queue.

Before submitting any tasks make sure you have loaded any shell modules you require - fsl_sub (and the required configuration information fsl_sub_config) are automatically loaded for you - for example to use FSL:
module add fsl

These lines can be added to your .bash_profile to ensure they take effect for every login session you have.

Submitting jobs with fsl_sub

Typing fsl_sub before the rest of your command will send the job to the cluster. By default, this will submit your task to the short partition. fsl_sub can automatically choose a queue for you if you provide information about your job's requirements - we would strongly recommend that you provide at least an estimated maximum run time (--jobtime) to allow SLURM to efficiently schedule job.


There are several ways to select a queue:

  1. Use the -R (--jobram) and -T (--jobtime) options to fsl_sub to specify the maximum memory and run-time requirements for your job (in GB and minutes of wall time*) respectively. fsl_sub will then select the most appropriate queue for you.
    GPU tasks can be requested using the --coprocessor options (see the Running GPU Tasks section).
  2. Specify a specific partition with the -q (--queue) option. For further information on the available queues and which to use when see the queues section.
Notes:
  • The command you want to run on the queues must be in your path - this does NOT include the current folder. If it isn't then you must specify the path to the command; commands/scripts in the current folder must be prefixed with './', e.g. ''./script''.
  • The FMRIB SLURM cluster does not have a 'verylong' or 'bigmem' equivalent queue. See Long Running Tasks below.
  • Jobs submitted to the FMRIB SLURM cluster do NOT inherit the 'environment' of your login shell, e.g. environment variables such as FSLDIR are not copied over to your job. Load software configuration (such as FSL) from shell modules or use the '--export' option to fsl_sub to copy the variables to your job (see Passing Envrionment Variables to Queued Jobs). 
  • Wall Time: Unlike the FMRIB Jalapeno cluster (which uses CPU time) the SLURM cluster measures job run-time in real time, often called wall time (as in the time on a clock on the wall).
    To assess the time necessary for your job to complete you can look at the run-times of similar previous jobs using the 'sacct' command (see Monitoring Tasks).

Example Usage

To queue a job which requires 10GiB of memory and runs for 2 hours use:

fsl_sub -T 120 -R 10 ./myjob
This will result in a job being put on the short partition.
If your software task automatically queues then you can also specify the memory you expect the task to require with the environment variable FSLSUB_MEMORY_REQUIRED, for example:
FSLSUB_MEMORY_REQUIRED=32G feat mydesign.feat

would submit a FEAT task informing fsl_sub that you expect to require 32GiB of memory. If units aren't specified then the integer is assumed to be in the units specified in the configuration file (default GiB).

The different partitions have different run-times and memory limits, when a task reaches these limits it will be terminated; also shorter queues take precedence over the longer ones. It is advantageous to provide the scheduler with as much information about your job's memory and time requirements.

The command you submit cannot run any graphical interface, as they will have no where to display the output.

If you want to run a non-interactive MATLAB task on the queues then see MATLAB jobs. 

fsl_sub Options

​To see a full list of the available options use:

​fsl_sub --help​

In addition to the list of options this will also display a list of partitions available for use with descriptions of allowed run times and memory availability. For details on how to use these options see the Advanced Usage section.

Long running tasks

Unlike the Jalapeno cluster, the SLURM cluster does not offer 'infinite' partitions (equivalent to verylong.q, bigmem.q and the cuda.q on the jalapeno cluster). You must break your task up into shorter components or regularly save state to allow restart and submit these parts (or resubmit the job continuing where it left off) using job holds to prevent tasks running before the previous one completes.​

How to monitor the progress of your submitted job

Please see BMRC documentation on job monitoring: Checking or deleting running jobs

How to pass environment variables to SLURM jobs

By default no environment variables from your current shell are passed to your job running on the cluster.

Where the important variables were set by loading an environment module, you do not need to do anything as fsl_sub will automatically load the currently loaded modules in your cluster job, but for other variables you can request that fsl_sub pass a sub-set of variables to your job with the --export option (pass this multiple times to export multiple variables). You can also use this option to set an environment variable in your job that are not already set in your shell.

For very complicated use cases or dynamic variable setting, create a script that sets up all your variables and then calls the software - submit this script to the cluster.

There are two ways to use --export:

  1. --export VARIABLENAME {--export VARIABLENAME}    This will copy the current environment variable setting into your job (specify multiple times for multiple variables)
  2. --export VARIABLENAME=VARIABLEVALUE     This will set the environment variable to the value after the '=' in the queued job only (not effecting your shell) so is ideal where you need to specify a job specific value

​fsl_sub will automatically transfer some internal variables and may have been configured to include ​​​​some additional useful ones, see the 'exports:' option in the output of 

fsl_sub --show_config

for the list of default exports. Any --export passed on the commandline will override these configured options. It is also possible to configure fsl_sub for your account to always copy over particular variables, see Configuring fsl_sub

Option 2, where you provide the variable with a value is particularly useful if you are scheduling many similar tasks and need to specify a different value for an environment variable for each run, for example SUBJECTS_DIR which FreeSurfer uses to specify where your data sets reside.

Available SLURM partitions

There are currently five partitions (often called queues on other platforms) configured in fsl_sub:

​Queue​Duration​Max-memory​Purpose
​short 1.2 days ​1TB ​batch jobs
​long ​10 days ​1TB ​batch jobs
​interactive ​10 days 256GB ​interactive jobs
gpu_short* 4h 61GB/gpu* GPU batch jobs
gpu_long* 60h 61GB/gpu* GPU batch and interactive jobs

* For the transition period there are very limited number of GPUs available. New GPU hardware is on order and this will increase the available memory per GPU.

More advanced techniques for submitting jobs, e.g. GPU, array and MATLAB tasks and full fsl_sub usage information

If your task comprises a complicated pipeline of interconnected tasks there are several options for splitting into dependent tasks or parallelisation of independent portions across many cluster nodes. Information on these techniques and other advance options is in this section.

Cluster advanced usage

Which tools automatically submit themselves to a cluster queue

Which tools automatically use a cluster? 

The following programs/scripts are able to self-submit in a HPC cluster and should not be used in conjunction with fsl_sub. 

Scripts that self-submit
FDT

bedpostx

FEAT
  • feat
  • feat_gm_prepare
FIRST
  • run_first_all
FSLVBM
  • fslvbm_2_template
  • fslvbm_3_proc
POSSUM
  • possumX
RANDOMISE
  • randomise_parallel
TBSS
  • tbss_2_reg
GUIs that self-submit
FDT
  • FDT GUI ( probtrackx only )
FEAT
  • FEAT GUI
FLIRT
  • FLIRT GUI
POSSUM
  • POSSUM GUI

Note that all other FSL GUIs will only run jobs on the local machine, to submit to a cluster you must use the equivalent command-line call in conjunction with fsl_sub.

How to troubleshoot failed jobs

Occasionally tasks will fail. When tasks begin running they generate two files, jobname.ojobid (referred to as the .o file) and jobname.ejobid (referred to as the .e file), which by default are created in the folder from which fsl_sub was run (or where your specified on the command line - FEAT tasks will create a logs folder within the .(g)feat folder).

The .o file contained any text that the program writes to the console whilst running, for example: 

fsl_sub ls my_fo​​lder

outputs the job id ''12345''. The task would generate a file ls.o12345 containing the folder listing for my_folder.

If your command produces a running commentary of its progress you could monitor this with the tail command:

tail -f command.o12345

This will continue displaying the contents of command.o12345, adding new content as it arrived until you exit (type CTRL-c).

The .e file contains any error messages the command produced. If you still need help then please contact the IT Team.

​​​Killing jobs

If, after submitting a job you release that there is a problem, you can kill the job with the command

scancel job_id

If the job is currently running there may be a short pause whilst the task is terminated.