Array Jobs

How to submit independent 'clone' tasks for running in parallel

An array task is a set of closely related tasks that do not rely on the output of any other members of the set of jobs. An example might be where you need to process each slice of a brain volume but there is no need to know or effect the content of any other slice (the array tasks can't communicate with each other to advise of changes to data). These tasks allow you to submit large numbers of discrete jobs and manage them under one job id, with each sub-task being allocated a unique task id and potentially able to run in parallel given enough compute slot availability.

You can submit an array task with the -t/--array_task option or with the --array_native option:

TEXT FILE ARRAY TASKS

The -t (or --array_task) option needs the name of a text file that contains the array task commands, one per line. Sub-tasks will be generated from these lines, with the task ID being equivalent to the line number in the file (starting from 1). e.g.

fsl_sub -q short.q -t ./myparalleljobs

The array task has a parent job id which can be used to control/delete all of the sub-tasks, the sub-tasks may be specified as job id:sub-task id, eg ''12345:10'' for sub-task 10 of job 12345.

NATIVE ARRAY TASKS

The --array_task option requires an argument n[-m[:s]] which specifies the array:

n provided alone will run the command n-times in parallel
n-m will run the command once for each number in the range with task ids equal to the position in this range
n-m:s similarly, but with s specifying the increment in task id.

The cluster software will set environment variables that the script/binary can use to determine what task they need to carry out. For example, this might be used to represent the brain volume slice to process. As these environment variables differ between different cluster software, fsl_sub sets several environment variables to the name of the environment variable the script can use to obtain it's task id from the cluster software:

Envrionment variable	...points to variable containing
FSLSUB_JOBID_VAR	job id
FSLSUB_ARRAYTASKID_VAR	task id
FSLSUB_ARRAYSTARTID_VAR	first task id
FSLSUB_ARRAYENDID_VAR	last task id
FSLSUB_ARRAYSTEPSIZE_VAR	step between task ids
FSLSUB_ARRAYCOUNT_VAR	number of tasks in array (not supported in Grid Engine)

To use these you need to look up the variable name and then read the value from the variable, for example in BASH use ${!FSLSUB_ARRAYTASKID_VAR} to get the value of the task id.

Important The tasks must be truly independent - ie, they must not write to the same file(s) or rely on calculations in other array jobs in this set otherwise you may get unpredictable results (or sub-tasks may crash).

LIMITING CONCURRENT ARRAY TASKS

Sometimes it may be necessary to limit the number of array sub-tasks runnning at any one time. You can do this by providing the -x (or --array_limit) option which takes a integer, e.g.:

fsl_sub -T10 -x 10 -t ./myparalleljobs

Will limit sub-tasks to ten running at any one time.

ARRAY TASKS WITH THE SHELL RUNNER

If running without a cluster backend or when fsl_sub is called from within an already scheduled task, the shell backend is capable of running array tasks in parallel. If running as a cluster job, the shell plugin will run no more than the number of threads selected in your parallel environment (if one is specified, default is one task at a time).

If you are not running on a cluster then by default fsl_sub will use all of the CPUs on your system. You can control this either using the -x|--array-limit option or by setting the environment variable FSLSUB_PARALLEL to the maximum number of array tasks to run at once. It is also possible to configure this in your own personal fsl_sub configuration file (see below).

Cookies on this website

Array Jobs

TEXT FILE ARRAY TASKS

NATIVE ARRAY TASKS​​

LIMITING CONCURRENT ARRAY TASKS​

ARRAY TASKS WITH THE SHELL RUNNER

NATIVE ARRAY TASKS

LIMITING CONCURRENT ARRAY TASKS