Compute Cluster

Submitting jobs to the compute cluster

Submitting and monitoring tasks

For details on how to submit and the monitor the progress of you jobs see the BMRC cluster page.

The BMRC cluster has transitioned to the SLURM cluster software and the fsl_sub module now uses the SLURM cluster.

SLURM is significantly different from Grid Engine, in particular, there are no RAM limits for jobs. We STRONGLY recommend that you specify RAM (with fsl_sub's -R option) to ensure efficient use of the cluster, without it, all jobs will will default to requesting 15GB of RAM. This also means that the -S/--noramsplit option is meaningless.

fsl_sub's native options remain the same, but of note, SLURM does not support parallel environments, so when requesting multi-thread jobs slots you can use -s <number>. If you provide a parallel environment name this will be discarded, so existing scripts should continue to work as is.

Interactive tasks are started in a completely different manner - see BMRC's documentation.

BMRC's documentation: https://www.medsci.ox.ac.uk/divisional-services/support-services-1/bmrc/using-the-bmrc-cluster-with-slurm

GPU hardware information and usage is available at: https://www.medsci.ox.ac.uk/for-staff/resources/bmrc/gpu-resources-slurm

How to run tasks on the cluster

Auto-submitting software

Some FSL commands and/or GUIs automatically queue themselves where appropriate, i.e., you do not need to use 'fsl_sub' to submit these programs. For a list of these FSL programs see: GE compatibility chart.

Please note that this list may not be exhaustive, so you may come across more commands which have been adapted to queue themselves. If you do submit one of these tools to the queues then they will still run, but may not be able to make full use of the cluster resources (e.g. not be able to run multiple tasks in parallel).

Other commands run from the terminal command line will need to use the ''fsl_sub'' command, described below, to submit them to the queue.

Before submitting any tasks make sure you have loaded any shell modules you require - for example to use fsl_sub and the FSL tool set run the following:

module add fsl

module add fsl_sub

These lines can be added to your .bash_profile to ensure they take effect for every login session you have.

SUBMITTING JOBS WITH FSL_SUB

Typing fsl_sub before the rest of your command will send the job to the cluster. By default, this will submit your task to the short partition. fsl_sub can automatically choose a queue for you if you provide information about your job's requirements - we would strongly recommend that you provide at least an estimated maximum run time (--jobtime) to allow SLURM to efficiently schedule jobs

There are several ways to select a queue:

Use the -R (--jobram) and -T (--jobtime) options to fsl_sub to specify the maximum memory and run-time requirements for your job (in GB and minutes of wall time*) respectively. fsl_sub will then select the most appropriate queue for you.
GPU tasks can be requested using the --coprocessor options (see the Running GPU Tasks section).
Specify a specific partition with the -q (--queue) option. For further information on the available queues and which to use when see the queues section.

Notes:

The command you want to run on the queues must be in your path - this does NOT include the current folder. If it isn't then you must specify the path to the command; commands/scripts in the current folder must be prefixed with './', e.g. ''./script''.
The BMRC cluster does not have a 'verylong' equivalent queue. See Long Running Tasks
Jobs submitted to the BMRC cluster do NOT inherit the 'environment' of your login shell, e.g. environment variables such as FSLDIR are not copied over to your job. Load software configuration (such as FSL) from shell modules or use the '--export' option to fsl_sub to copy the variables to your job (see Passing Envrionment Variables to Queued Jobs).
Wall Time: Unlike the FMRIB cluster (which uses CPU time) the BMRC cluster measures job run-time in real time, often called wall time (as in the time on a clock on the wall).
To assess the time necessary for your job to complete you can look at the run-times of similar previous jobs using the 'sacct' command (see Monitoring Tasks).

Example Usage

To queue a job which requires 10GiB of memory and runs for 2 hours use:

fsl_sub -T 120 -R 10 ./myjob

This will result in a job being put on the short partition.

If your software task automatically queues then you can also specify the memory you expect the task to require with the environment variable FSLSUB_MEMORY_REQUIRED, for example:

FSLSUB_MEMORY_REQUIRED=32G feat mydesign.feat

would submit a FEAT task informing fsl_sub that you expect to require 32GiB of memory. If units aren't specified then the integer is assumed to be in the units specified in the configuration file (default GiB).

The different partitions have different run-times and memory limits, when a task reaches these limits it will be terminated; also shorter queues take precedence over the longer ones. It is advantageous to provide the scheduler with as much information about your job's memory and time requirements.

The command you submit cannot run any graphical interface, as they will have no where to display the output. If you have a task that insists on producing some output contact the BMRC Team for advice.

If you want to run a non-interactive MATLAB task on the queues then see the MATLAB section.

FSL_SUB OPTIONS

To see a full list of the available options use:

fsl_sub --help

In addition to the list of options this will also display a list of partitions available for use with descriptions of allowed run times and memory availability. For details on how to use these options see the Advanced Usage section.

Long running tasks

You must break your task up into shorter components or regularly save state to allow restart and submit these parts (or resubmit the job continuing where it left off) using job holds to prevent tasks running before the previous one completes.

Available SLURM partitions

There are currently six partitions (often called queues on other platforms) configured in fsl_sub:

Queue	Duration	Max-memory	Purpose
short	1.2 days	385GB	batch and interactive jobs
long	10 days	385GB	batch and interactive jobs
epyc	10 days	515GB	batch and interactive jobs - more cores, but less RAM per core, allows for more threads than other nodes
win	10 days	1.5TB	batch and interactive jobs This targets the WIN private hosts, prefer these for MATLAB etc.
gpu_short	4h	750GB/gpu	GPU tasks
gpu_long	60h	750GB/gpu	GPU tasks

How to pass environment variables to SLURM jobs

The BMRC cluster uses software libraries optimised for the each hardware generation in use, with a shell module loading the correct library for the hardware the job runs on. This means that fsl_sub cannot pass your current shell environment variables to the job.

You can request that a sub-set of the environment variables in your session are passed to your job with the --export option (pass this multiple times to export multiple variables). You can also use this option to set an environment variable in your job that is not already set in your session.

fsl_sub will automatically load any currently loaded shell modules in your job's shell, so if possible use shell modules to configure software rather than setting environment variables. For very complicated use cases or dynamic variable setting, create a script that sets up all your variables and then calls the software and submit this script to the cluster.

There are two ways to use --export:

--export VARIABLENAME {--export VARIABLENAME} This will copy the current environment variable setting into your job (specify multiple times for multiple variables)
--export VARIABLENAME=VARIABLEVALUE This will set the environment variable to the value after the '=' in the queued job only (not effecting your shell) so is ideal where you need to specify a job specific value

fsl_sub will automatically transfer some internal variables and may have been configured to include some additional useful ones, see the 'exports:' option in the output of

fsl_sub --show_config

for the list of default exports. Any --export passed on the commandline will override these configured options. It is also possible to configure fsl_sub for your account to always copy over particular variables, see Configuring fsl_sub.

Option 2, where you provide the variable with a value is particularly useful if you are scheduling many similar tasks and need to specify a different value for an environment variable for each run, for example SUBJECTS_DIR which FreeSurfer uses to specify where your data sets reside.

How to monitor the progress of your submitted job

Please see BMRC documentation on job monitoring: Checking or deleting running jobs

More advanced techniques for submitting jobs, e.g. GPU, array and MATLAB tasks and full fsl_sub usage information

If your task comprises a complicated pipeline of interconnected tasks there are several options for splitting into dependent tasks or parallelisation of independent portions across many cluster nodes. Information on these techniques and other advance options is in this section.

Cluster advanced usage

How to troubleshoot failed jobs

Occasionally tasks will fail. When tasks begin running they generate two files, jobname.ojobid (referred to as the .o file) and jobname.ejobid (referred to as the .e file), which by default are created in the folder from which fsl_sub was run (or where your specified on the command line - FEAT tasks will create a logs folder within the .(g)feat folder).

The .o file contained any text that the program writes to the console whilst running, for example:

fsl_sub ls my_folder

outputs the job id ''12345''. The task would generate a file ls.o12345 containing the folder listing for my_folder.

If your command produces a running commentary of its progress you could monitor this with the tail command:

tail -f command.o12345

This will continue displaying the contents of command.o12345, adding new content as it arrived until you exit (type CTRL-c).

The .e file contains any error messages the command produced. If you still need help then please contact the IT Team.

KILLING JOBS

If, after submitting a job you release that there is a problem, you can kill the job with the command

scancel job_id

If the job is currently running there may be a short pause whilst the task is terminated.

Cookies on this website