Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

How to request a GPU for your job

Whilst GPU tasks can simply be submitted to the or queues fsl_sub also provides helper options which can automatically select a GPU queue and select the appropriate CUDA toolkit for you.

  • ​-c|--coprocessor <coprocessor name>: This selects the coprocessor with the given name (see fsl_sub --help for details of available coprocessors)
  • --coprocessor_multi <number>: This allows you to request multiple GPUs. On the FMRIB cluster you can select no more than two GPUs. You will automatically be given a two-slot openmp parallel environment
  • --coprocessor_class <class>: This would allow you to select which GPU hardware model you require, see fsl_sub --help for details
  • --coprocessor_class_strict: If a class is requested you will normally be allocated a card at least as capable as the model requested. By adding this option you ensure that you only get the GPU model you asked for
  • --coprocessor_toolkit <toolkit version>: This allows you to select the API toolkit your sofware needs. This will automatically make available the requested CUDA libraries where these haven't been compiled into the software
There are two CUDA coprocessor definitions configured for fsl_sub, cuda and cuda_ml
  • cuda selects GPUs capable of high-performance double-precision workloads and would normally be used for queued tasks such as Eddy and BedpostX.
  • cuda_all selects all GPUs.
  • cuda_ml selects GPUs more suited to machine learning tasks, the typically have very poor double-precision performance, instead being optimised for single, half and quarter precision workloads - use these for tasks involving ML inference and development, although training may still be more optimal on the general purpose GPUs depending on the task this involves, ask the developer of the software for advice on this. 
GPU and queue aware tools will automatically select the cuda​ queue if they detect it.
Although the V100 and A100 cards reside in the cuda sub-queue they are suitable for machine learning tasks if there are capacity issues with the cuda_ml devices.
When indicating RAM requirements with -R you should consider the following quotation from the BMRC pages:​
When submitting jobs, the total memory requirement for your job should be equal to the the compute memory + GPU memory i.e. you will need to request a sufficient number of slots to cover this total memory requirement. 


Where your program requires interaction we offer an interactive queue which can be used to get a terminal session on one of the cluster nodes.​

To request a terminal session for GPU tasks, issue the following command on a resomp head node:

srun -p gpu_short --pty bash

There may be a delay whilst the system finds a suitable host. Once one becomes available, if this is the first time you have logged into a particular node you may be asked to accept the host key. Enter `yes` to accept this host key and then you will be presented with a terminal session. Your job will be subject to the same limits as a batch job and if you expect your session to need to be interupted then you should start screen or tmux on rescomp before using qlogin.

For example, if you wish to use a deep/machine learning optimised RTX8000 card use:

srun -p gpu_short --gres gpu:quadro-rtx8000:1 --pty bash

The RTX8000 nodes are deployed in pairs, so if your software is multi-GPU aware you can request two GPUs to double your available GPU compute power and GPU memory. You can do this by increasing the number of GPUs requested with:

srun -p gpu_short --gres gpu:quadro-rtx8000:2 --pty bash

​The pairs of cards share a high-speed interconnect, so although slower than local GPU memory, the performance of this distant GPU memory is significantly higher than a traditional multi-GPU setup.​​​