srun - Run parallel jobs
srun [OPTIONS...] executable [args...]
Run a parallel job on cluster managed by Slurm. If necessary, srun will first create a resource allocation in which to run the parallel job. The following document describes the influence of various options on the allocation of cpus to jobs and tasks. http://slurm.schedmd.com/cpu_management.html
--accel-bind=<options>
Control how tasks are bound to generic resources of type gpu,
mic and nic. Multiple options may be specified. Supported
options include:
g Bind each task to GPUs which are closest to the allocated
CPUs.
m Bind each task to MICs which are closest to the allocated
CPUs.
n Bind each task to NICs which are closest to the allocated
CPUs.
v Verbose mode. Log how tasks are bound to GPU and NIC
devices.
This option applies to job allocations.
-A, --account=<account>
Charge resources used by this job to specified account. The
account is an arbitrary string. The account name may be changed
after job submission using the scontrol command. This option
applies to job allocations.
--acctg-freq
Define the job accounting and profiling sampling intervals.
This can be used to override the JobAcctGatherFrequency
parameter in Slurm's configuration file, slurm.conf. The
supported format is follows:
--acctg-freq=<datatype>=<interval>
where <datatype>=<interval> specifies the task
sampling interval for the jobacct_gather plugin or a
sampling interval for a profiling type by the
acct_gather_profile plugin. Multiple, comma-
separated <datatype>=<interval> intervals may be
specified. Supported datatypes are as follows:
task=<interval>
where <interval> is the task sampling
interval in seconds for the jobacct_gather
plugins and for task profiling by the
acct_gather_profile plugin. NOTE: This
frequency is used to monitor memory usage. If
memory limits are enforced the highest
frequency a user can request is what is
configured in the slurm.conf file. They can
not turn it off (=0) either.
energy=<interval>
where <interval> is the sampling interval in
seconds for energy profiling using the
acct_gather_energy plugin
network=<interval>
where <interval> is the sampling interval in
seconds for infiniband profiling using the
acct_gather_infiniband plugin.
filesystem=<interval>
where <interval> is the sampling interval in
seconds for filesystem profiling using the
acct_gather_filesystem plugin.
The default value for the task sampling
interval
is 30. The default value for all other intervals is 0. An
interval of 0 disables sampling of the specified type. If the
task sampling interval is 0, accounting information is collected
only at job termination (reducing Slurm interference with the
job).
Smaller (non-zero) values have a greater impact upon job
performance, but a value of 30 seconds is not likely to be
noticeable for applications having less than 10,000 tasks. This
option applies job allocations.
-B --extra-node-info=<sockets[:cores[:threads]]>
Request a specific allocation of resources with details as to
the number and type of computational resources within a cluster:
number of sockets (or physical processors) per node, cores per
socket, and threads per core. The total amount of resources
being requested is the product of all of the terms. Each value
specified is considered a minimum. An asterisk (*) can be used
as a placeholder indicating that all available resources of that
type are to be utilized. As with nodes, the individual levels
can also be specified in separate options if desired:
--sockets-per-node=<sockets>
--cores-per-socket=<cores>
--threads-per-core=<threads>
If task/affinity plugin is enabled, then specifying an
allocation in this manner also sets a default --cpu_bind option
of threads if the -B option specifies a thread count, otherwise
an option of cores if a core count is specified, otherwise an
option of sockets. If SelectType is configured to
select/cons_res, it must have a parameter of CR_Core,
CR_Core_Memory, CR_Socket, or CR_Socket_Memory for this option
to be honored. This option is not supported on BlueGene systems
(select/bluegene plugin is configured). If not specified, the
scontrol show job will display 'ReqS:C:T=*:*:*'. This option
applies to job allocations.
--bb=<spec>
Burst buffer specification. The form of the specification is
system dependent. Also see --bbf. This option applies to job
allocations.
--bbf=<file_name>
Path of file containing burst buffer specification. The form of
the specification is system dependent. Also see --bb. This
option applies to job allocations.
--bcast[=<dest_path>]
Copy executable file to allocated compute nodes. If a file name
is specified, copy the executable to the specified destination
file path. If no path is specified, copy the file to a file
named "slurm_bcast_<job_id>.<step_id>" in the current working.
For example, "srun --bcast=/tmp/mine -N3 a.out" will copy the
file "a.out" from your current directory to the file "/tmp/mine"
on each of the three allocated compute nodes and execute that
file. This option applies to step allocations.
--begin=<time>
Defer initiation of this job until the specified time. It
accepts times of the form HH:MM:SS to run a job at a specific
time of day (seconds are optional). (If that time is already
past, the next day is assumed.) You may also specify midnight,
noon, fika (3 PM) or teatime (4 PM) and you can have a
time-of-day suffixed with AM or PM for running in the morning or
the evening. You can also say what day the job will be run, by
specifying a date of the form MMDDYY or MM/DD/YY YYYY-MM-DD.
Combine date and time using the following format
YYYY-MM-DD[THH:MM[:SS]]. You can also give times like now +
count time-units, where the time-units can be seconds (default),
minutes, hours, days, or weeks and you can tell Slurm to run the
job today with the keyword today and to run the job tomorrow
with the keyword tomorrow. The value may be changed after job
submission using the scontrol command. For example:
--begin=16:00
--begin=now+1hour
--begin=now+60 (seconds by default)
--begin=2010-01-20T12:34:00
Notes on date/time specifications:
- Although the 'seconds' field of the HH:MM:SS time
specification is allowed by the code, note that the poll time of
the Slurm scheduler is not precise enough to guarantee dispatch
of the job on the exact second. The job will be eligible to
start on the next poll following the specified time. The exact
poll interval depends on the Slurm scheduler (e.g., 60 seconds
with the default sched/builtin).
- If no time (HH:MM:SS) is specified, the default is
(00:00:00).
- If a date is specified without a year (e.g., MM/DD) then the
current year is assumed, unless the combination of MM/DD and
HH:MM:SS has already passed for that year, in which case the
next year is used.
This option applies to job allocations.
--checkpoint=<time>
Specifies the interval between creating checkpoints of the job
step. By default, the job step will have no checkpoints
created. Acceptable time formats include "minutes",
"minutes:seconds", "hours:minutes:seconds", "days-hours",
"days-hours:minutes" and "days-hours:minutes:seconds". This
option applies to job and step allocations.
--checkpoint-dir=<directory>
Specifies the directory into which the job or job step's
checkpoint should be written (used by the checkpoint/blcr and
checkpoint/xlch plugins only). The default value is the current
working directory. Checkpoint files will be of the form
"<job_id>.ckpt" for jobs and "<job_id>.<step_id>.ckpt" for job
steps. This option applies to job and step allocations.
--comment=<string>
An arbitrary comment. This option applies to job allocations.
--compress[=type]
Compress file before sending it to compute hosts. The optional
argument specifies the data compression library to be used.
Supported values are "lz4" (default) and "zlib". Some
compression libraries may be unavailable on some systems. For
use with the --bcast option. This option applies to step
allocations.
-C, --constraint=<list>
Nodes can have features assigned to them by the Slurm
administrator. Users can specify which of these features are
required by their job using the constraint option. Only nodes
having features matching the job constraints will be used to
satisfy the request. Multiple constraints may be specified with
AND, OR, matching OR, resource counts, etc. (some operators are
not supported on all system types). Supported constraint
options include:
Single Name
Only nodes which have the specified feature will be used.
For example, --constraint="intel"
Node Count
A request can specify the number of nodes needed with
some feature by appending an asterisk and count after the
feature name. For example "--nodes=16
--constraint=graphics*4 ..." indicates that the job
requires 16 nodes and that at least four of those nodes
must have the feature "graphics."
AND If only nodes with all of specified features will be
used. The ampersand is used for an AND operator. For
example, --constraint="intel&gpu"
OR If only nodes with at least one of specified features
will be used. The vertical bar is used for an OR
operator. For example, --constraint="intel|amd"
Matching OR
If only one of a set of possible options should be used
for all allocated nodes, then use the OR operator and
enclose the options within square brackets. For example:
"--constraint=[rack1|rack2|rack3|rack4]" might be used to
specify that all nodes must be allocated on a single rack
of the cluster, but any of those four racks can be used.
Multiple Counts
Specific counts of multiple resources may be specified by
using the AND operator and enclosing the options within
square brackets. For example:
"--constraint=[rack1*2&rack2*4]" might be used to specify
that two nodes must be allocated from nodes with the
feature of "rack1" and four nodes must be allocated from
nodes with the feature "rack2".
WARNING: When srun is executed from within salloc or
sbatch,
the constraint value can only contain a single feature name.
None of the other operators are currently supported for job
steps.
This option applies to job and step allocations.
--contiguous
If set, then the allocated nodes must form a contiguous set.
Not honored with the topology/tree or topology/3d_torus plugins,
both of which can modify the node ordering. This option applies
to job allocations.
--cores-per-socket=<cores>
Restrict node selection to nodes with at least the specified
number of cores per socket. See additional information under -B
option above when task/affinity plugin is enabled. This option
applies to job allocations.
--cpu_bind=[{quiet,verbose},]type
Bind tasks to CPUs. Used only when the task/affinity or
task/cgroup plugin is enabled. NOTE: To have Slurm always
report on the selected CPU binding for all commands executed in
a shell, you can enable verbose mode by setting the
SLURM_CPU_BIND environment variable value to "verbose".
The following informational environment variables are set when
--cpu_bind is in use:
SLURM_CPU_BIND_VERBOSE
SLURM_CPU_BIND_TYPE
SLURM_CPU_BIND_LIST
See the ENVIRONMENT VARIABLES section for a more detailed
description of the individual SLURM_CPU_BIND variables. These
variable are available only if the task/affinity plugin is
configured.
When using --cpus-per-task to run multithreaded tasks, be aware
that CPU binding is inherited from the parent of the process.
This means that the multithreaded task should either specify or
clear the CPU binding itself to avoid having all threads of the
multithreaded task use the same mask/CPU as the parent.
Alternatively, fat masks (masks which specify more than one
allowed CPU) could be used for the tasks in order to provide
multiple CPUs for the multithreaded tasks.
By default, a job step has access to every CPU allocated to the
job. To ensure that distinct CPUs are allocated to each job
step, use the --exclusive option.
Note that a job step can be allocated different numbers of CPUs
on each node or be allocated CPUs not starting at location zero.
Therefore one of the options which automatically generate the
task binding is recommended. Explicitly specified masks or
bindings are only honored when the job step has been allocated
every available CPU on the node.
Binding a task to a NUMA locality domain means to bind the task
to the set of CPUs that belong to the NUMA locality domain or
"NUMA node". If NUMA locality domain options are used on
systems with no NUMA support, then each socket is considered a
locality domain.
Auto Binding
Applies only when task/affinity is enabled. If the job
step allocation includes an allocation with a number of
sockets, cores, or threads equal to the number of tasks
times cpus-per-task, then the tasks will by default be
bound to the appropriate resources (auto binding).
Disable this mode of operation by explicitly setting
"--cpu_bind=none". Use
TaskPluginParam=autobind=[threads|cores|sockets] to set a
default cpu binding in case "auto binding" doesn't find a
match.
Supported options include:
q[uiet]
Quietly bind before task runs (default)
v[erbose]
Verbosely report binding before task runs
no[ne] Do not bind tasks to CPUs (default unless auto
binding is applied)
rank Automatically bind by task rank. The lowest
numbered task on each node is bound to socket (or
core or thread) zero, etc. Not supported unless
the entire node is allocated to the job.
map_cpu:<list>
Bind by mapping CPU IDs to tasks as specified
where <list> is <cpuid1>,<cpuid2>,...<cpuidN>.
The mapping is specified for a node and identical
mapping is applied to the tasks on every node
(i.e. the lowest task ID on each node is mapped to
the first CPU ID specified in the list, etc.).
CPU IDs are interpreted as decimal values unless
they are preceded with '0x' in which case they are
interpreted as hexadecimal values. Not supported
unless the entire node is allocated to the job.
mask_cpu:<list>
Bind by setting CPU masks on tasks as specified
where <list> is <mask1>,<mask2>,...<maskN>. The
mapping is specified for a node and identical
mapping is applied to the tasks on every node
(i.e. the lowest task ID on each node is mapped to
the first mask specified in the list, etc.). CPU
masks are always interpreted as hexadecimal values
but can be preceded with an optional '0x'. Not
supported unless the entire node is allocated to
the job.
rank_ldom
Bind to a NUMA locality domain by rank. Not
supported unless the entire node is allocated to
the job.
map_ldom:<list>
Bind by mapping NUMA locality domain IDs to tasks
as specified where <list> is
<ldom1>,<ldom2>,...<ldomN>. The locality domain
IDs are interpreted as decimal values unless they
are preceded with '0x' in which case they are
interpreted as hexadecimal values. Not supported
unless the entire node is allocated to the job.
mask_ldom:<list>
Bind by setting NUMA locality domain masks on
tasks as specified where <list> is
<mask1>,<mask2>,...<maskN>. NUMA locality domain
masks are always interpreted as hexadecimal values
but can be preceded with an optional '0x'. Not
supported unless the entire node is allocated to
the job.
sockets
Automatically generate masks binding tasks to
sockets. Only the CPUs on the socket which have
been allocated to the job will be used. If the
number of tasks differs from the number of
allocated sockets this can result in sub-optimal
binding.
cores Automatically generate masks binding tasks to
cores. If the number of tasks differs from the
number of allocated cores this can result in
sub-optimal binding.
threads
Automatically generate masks binding tasks to
threads. If the number of tasks differs from the
number of allocated threads this can result in
sub-optimal binding.
ldoms Automatically generate masks binding tasks to NUMA
locality domains. If the number of tasks differs
from the number of allocated locality domains this
can result in sub-optimal binding.
boards Automatically generate masks binding tasks to
boards. If the number of tasks differs from the
number of allocated boards this can result in
sub-optimal binding. This option is supported by
the task/cgroup plugin only.
help Show help message for cpu_bind
This option applies to job and step allocations.
--cpu-freq =<p1[-p2[:p3]]>
Request that the job step initiated by this srun command be run
at some requested frequency if possible, on the CPUs selected
for the step on the compute node(s).
p1 can be [#### | low | medium | high | highm1] which will set
the frequency scaling_speed to the corresponding value, and set
the frequency scaling_governor to UserSpace. See below for
definition of the values.
p1 can be [Conservative | OnDemand | Performance | PowerSave]
which will set the scaling_governor to the corresponding value.
The governor has to be in the list set by the slurm.conf option
CpuFreqGovernors.
When p2 is present, p1 will be the minimum scaling frequency and
p2 will be the maximum scaling frequency.
p2 can be [#### | medium | high | highm1] p2 must be greater
than p1.
p3 can be [Conservative | OnDemand | Performance | PowerSave |
UserSpace] which will set the governor to the corresponding
value.
If p3 is UserSpace, the frequency scaling_speed will be set by a
power or energy aware scheduling strategy to a value between p1
and p2 that lets the job run within the site's power goal. The
job may be delayed if p1 is higher than a frequency that allows
the job to run within the goal.
If the current frequency is < min, it will be set to min.
Likewise, if the current frequency is > max, it will be set to
max.
Acceptable values at present include:
#### frequency in kilohertz
Low the lowest available frequency
High the highest available frequency
HighM1 (high minus one) will select the next highest
available frequency
Medium attempts to set a frequency in the middle of the
available range
Conservative attempts to use the Conservative CPU governor
OnDemand attempts to use the OnDemand CPU governor (the
default value)
Performance attempts to use the Performance CPU governor
PowerSave attempts to use the PowerSave CPU governor
UserSpace attempts to use the UserSpace CPU governor
The following informational environment variable is set
in the job
step when --cpu-freq option is requested.
SLURM_CPU_FREQ_REQ
This environment variable can also be used to supply the value
for the CPU frequency request if it is set when the 'srun'
command is issued. The --cpu-freq on the command line will
override the environment variable value. The form on the
environment variable is the same as the command line. See the
ENVIRONMENT VARIABLES section for a description of the
SLURM_CPU_FREQ_REQ variable.
NOTE: This parameter is treated as a request, not a requirement.
If the job step's node does not support setting the CPU
frequency, or the requested value is outside the bounds of the
legal frequencies, an error is logged, but the job step is
allowed to continue.
NOTE: Setting the frequency for just the CPUs of the job step
implies that the tasks are confined to those CPUs. If task
confinement (i.e., TaskPlugin=task/affinity or
TaskPlugin=task/cgroup with the "ConstrainCores" option) is not
configured, this parameter is ignored.
NOTE: When the step completes, the frequency and governor of
each selected CPU is reset to the previous values.
NOTE: When submitting jobs with the --cpu-freq option with
linuxproc as the ProctrackType can cause jobs to run too quickly
before Accounting is able to poll for job information. As a
result not all of accounting information will be present.
This option applies to job and step allocations.
-c, --cpus-per-task=<ncpus>
Request that ncpus be allocated per process. This may be useful
if the job is multithreaded and requires more than one CPU per
task for optimal performance. The default is one CPU per
process. If -c is specified without -n, as many tasks will be
allocated per node as possible while satisfying the -c
restriction. For instance on a cluster with 8 CPUs per node, a
job request for 4 nodes and 3 CPUs per task may be allocated 3
or 6 CPUs per node (1 or 2 tasks per node) depending upon
resource consumption by other jobs. Such a job may be unable to
execute more than a total of 4 tasks. This option may also be
useful to spawn tasks without allocating resources to the job
step from the job's allocation when running multiple job steps
with the --exclusive option.
WARNING: There are configurations and options interpreted
differently by job and job step requests which can result in
inconsistencies for this option. For example srun -c2
--threads-per-core=1 prog may allocate two cores for the job,
but if each of those cores contains two threads, the job
allocation will include four CPUs. The job step allocation will
then launch two threads per CPU for a total of two tasks.
WARNING: When srun is executed from within salloc or sbatch,
there are configurations and options which can result in
inconsistent allocations when -c has a value greater than -c on
salloc or sbatch.
This option applies to job allocations.
--deadline=<OPT>
remove the job if no ending is possible before this deadline
(start > (deadline - time[-min])). Default is no deadline.
Valid time formats are:
HH:MM[:SS] [AM|PM]
MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]
MM/DD[/YY]-HH:MM[:SS]
YYYY-MM-DD[THH:MM[:SS]]]
This option applies to job allocations.
-d, --dependency=<dependency_list>
Defer the start of this job until the specified dependencies
have been satisfied completed. This option does not apply to job
steps (executions of srun within an existing salloc or sbatch
allocation) only to job allocations. <dependency_list> is of
the form <type:job_id[:job_id][,type:job_id[:job_id]]> or
<type:job_id[:job_id][?type:job_id[:job_id]]>. All dependencies
must be satisfied if the "," separator is used. Any dependency
may be satisfied if the "?" separator is used. Many jobs can
share the same dependency and these jobs may even belong to
different users. The value may be changed after job submission
using the scontrol command. Once a job dependency fails due to
the termination state of a preceding job, the dependent job will
never be run, even if the preceding job is requeued and has a
different termination state in a subsequent execution. This
option applies to job allocations.
after:job_id[:jobid...]
This job can begin execution after the specified jobs
have begun execution.
afterany:job_id[:jobid...]
This job can begin execution after the specified jobs
have terminated.
aftercorr:job_id[:jobid...]
A task of this job array can begin execution after the
corresponding task ID in the specified job has completed
successfully (ran to completion with an exit code of
zero).
afternotok:job_id[:jobid...]
This job can begin execution after the specified jobs
have terminated in some failed state (non-zero exit code,
node failure, timed out, etc).
afterok:job_id[:jobid...]
This job can begin execution after the specified jobs
have successfully executed (ran to completion with an
exit code of zero).
expand:job_id
Resources allocated to this job should be used to expand
the specified job. The job to expand must share the same
QOS (Quality of Service) and partition. Gang scheduling
of resources in the partition is also not supported.
singleton
This job can begin execution after any previously
launched jobs sharing the same job name and user have
terminated.
-D, --chdir=<path>
Have the remote processes do a chdir to path before beginning
execution. The default is to chdir to the current working
directory of the srun process. The path can be specified as full
path or relative path to the directory where the command is
executed. This option applies to job allocations.
-e, --error=<mode>
Specify how stderr is to be redirected. By default in
interactive mode, srun redirects stderr to the same file as
stdout, if one is specified. The --error option is provided to
allow stdout and stderr to be redirected to different locations.
See IO Redirection below for more options. If the specified
file already exists, it will be overwritten. This option applies
to job and step allocations.
-E, --preserve-env
Pass the current values of environment variables SLURM_NNODES
and SLURM_NTASKS through to the executable, rather than
computing them from commandline parameters. This option applies
to job allocations.
--epilog=<executable>
srun will run executable just after the job step completes. The
command line arguments for executable will be the command and
arguments of the job step. If executable is "none", then no
srun epilog will be run. This parameter overrides the SrunEpilog
parameter in slurm.conf. This parameter is completely
independent from the Epilog parameter in slurm.conf. This option
applies to job allocations.
--exclusive[=user|mcs]
This option applies to job and job step allocations, and has two
slightly different meanings for each one. When used to initiate
a job, the job allocation cannot share nodes with other running
jobs (or just other users with the "=user" option or "=mcs"
option). The default shared/exclusive behavior depends on
system configuration and the partition's OverSubscribe option
takes precedence over the job's option.
This option can also be used when initiating more than one job
step within an existing resource allocation, where you want
separate processors to be dedicated to each job step. If
sufficient processors are not available to initiate the job
step, it will be deferred. This can be thought of as providing a
mechanism for resource management to the job within it's
allocation.
The exclusive allocation of CPUs only applies to job steps
explicitly invoked with the --exclusive option. For example, a
job might be allocated one node with four CPUs and a remote
shell invoked on the allocated node. If that shell is not
invoked with the --exclusive option, then it may create a job
step with four tasks using the --exclusive option and not
conflict with the remote shell's resource allocation. Use the
--exclusive option to invoke every job step to insure distinct
resources for each step.
Note that all CPUs allocated to a job are available to each job
step unless the --exclusive option is used plus task affinity is
configured. Since resource management is provided by processor,
the --ntasks option must be specified, but the following options
should NOT be specified --relative, --distribution=arbitrary.
See EXAMPLE below.
--export=<environment variables | NONE>
Identify which environment variables are propagated to the
launched application. Multiple environment variable names
should be comma separated. Environment variable names may be
specified to propagate the current value of those variables
(e.g. "--export=EDITOR") or specific values for the variables
may be exported (e.g.. "--export=EDITOR=/bin/vi") in addition to
the environment variables that would otherwise be set. By
default all environment variables are propagated. With
"--export=NONE" no environment variables will be propagated
unless explicitly listed (e.g.,
"--export=NONE,PATH=/bin,SHELL=/bin/bash"). Regardless of this
setting, the appropriate "SLURM_*" task environment variables
are always exported to the environment. This option applies to
job allocations.
--gid=<group>
If srun is run as root, and the --gid option is used, submit the
job with group's group access permissions. group may be the
group name or the numerical group ID. This option applies to job
allocations.
--gres=<list>
Specifies a comma delimited list of generic consumable
resources. The format of each entry on the list is
"name[[:type]:count]". The name is that of the consumable
resource. The count is the number of those resources with a
default value of 1. The specified resources will be allocated
to the job on each node. The available generic consumable
resources is configurable by the system administrator. A list
of available generic consumable resources will be printed and
the command will exit if the option argument is "help".
Examples of use include "--gres=gpu:2,mic=1",
"--gres=gpu:kepler:2", and "--gres=help". NOTE: This option
applies to job and step allocations. By default, a job step is
allocated all of the generic resources that have allocated to
the job. To change the behavior so that each job step is
allocated no generic resources, explicitly set the value of
--gres to specify zero counts for each generic resource OR set
"--gres=none" OR set the SLURM_STEP_GRES environment variable to
"none".
--gres-flags=enforce-binding
If set, the only CPUs available to the job will be those bound
to the selected GRES (i.e. the CPUs identifed in the gres.conf
file will be strictly enforced rather than advisory). This
option may result in delayed initiation of a job. For example a
job requiring two GPUs and one CPU will be delayed until both
GPUs on a single socket are available rather than using GPUs
bound to separate sockets, however the application performance
may be improved due to improved communication speed. Requires
the node to be configured with more than one socket and resource
filtering will be performed on a per-socket basis. This option
applies to job allocations.
-H, --hold
Specify the job is to be submitted in a held state (priority of
zero). A held job can now be released using scontrol to reset
its priority (e.g. "scontrol release <job_id>"). This option
applies to job allocations.
-h, --help
Display help information and exit.
--hint=<type>
Bind tasks according to application hints.
compute_bound
Select settings for compute bound applications: use all
cores in each socket, one thread per core.
memory_bound
Select settings for memory bound applications: use only
one core in each socket, one thread per core.
[no]multithread
[don't] use extra threads with in-core multi-threading
which can benefit communication intensive applications.
Only supported with the task/affinity plugin.
help show this help message
This option applies to job allocations.
-I, --immediate[=<seconds>]
exit if resources are not available within the time period
specified. If no argument is given, resources must be available
immediately for the request to succeed. By default, --immediate
is off, and the command will block until resources become
available. Since this option's argument is optional, for proper
parsing the single letter option must be followed immediately
with the value and not include a space between them. For example
"-I60" and not "-I 60". This option applies to job and step
allocations.
-i, --input=<mode>
Specify how stdin is to redirected. By default, srun redirects
stdin from the terminal all tasks. See IO Redirection below for
more options. For OS X, the poll() function does not support
stdin, so input from a terminal is not possible. This option
applies to job and step allocations.
-J, --job-name=<jobname>
Specify a name for the job. The specified name will appear along
with the job id number when querying running jobs on the system.
The default is the supplied executable program's name. NOTE:
This information may be written to the slurm_jobacct.log file.
This file is space delimited so if a space is used in the
jobname name it will cause problems in properly displaying the
contents of the slurm_jobacct.log file when the sacct command is
used. This option applies to job and step allocations.
--jobid=<jobid>
Initiate a job step under an already allocated job with job id
id. Using this option will cause srun to behave exactly as if
the SLURM_JOB_ID environment variable was set. This option
applies to job and step allocations. NOTE: For job allocations,
this is only valid for users root and SlurmUser.
-K, --kill-on-bad-exit[=0|1]
Controls whether or not to terminate a job if any task exits
with a non-zero exit code. If this option is not specified, the
default action will be based upon the Slurm configuration
parameter of KillOnBadExit. If this option is specified, it will
take precedence over KillOnBadExit. An option argument of zero
will not terminate the job. A non-zero argument or no argument
will terminate the job. Note: This option takes precedence over
the -W, --wait option to terminate the job immediately if a task
exits with a non-zero exit code. Since this option's argument
is optional, for proper parsing the single letter option must be
followed immediately with the value and not include a space
between them. For example "-K1" and not "-K 1". This option
applies to job allocations.
-k, --no-kill
Do not automatically terminate a job if one of the nodes it has
been allocated fails. This option applies to job and step
allocations. The job will assume all responsibilities for
fault-tolerance. Tasks launch using this option will not be
considered terminated (e.g. -K, --kill-on-bad-exit and -W,
--wait options will have no effect upon the job step). The
active job step (MPI job) will likely suffer a fatal error, but
subsequent job steps may be run if this option is specified.
The default action is to terminate the job upon node failure.
--launch-cmd
Print external launch command instead of running job normally
through Slurm. This option is only valid if using something
other than the launch/slurm plugin. This option applies to step
allocations.
--launcher-opts=<options>
Options for the external launcher if using something other than
the launch/slurm plugin. This option applies to step
allocations.
-l, --label
Prepend task number to lines of stdout/err. The --label option
will prepend lines of output with the remote task id. This
option applies to step allocations.
-L, --licenses=<license>
Specification of licenses (or other resources available on all
nodes of the cluster) which must be allocated to this job.
License names can be followed by a colon and count (the default
count is one). Multiple license names should be comma separated
(e.g. "--licenses=foo:4,bar"). This option applies to job
allocations.
-m, --distribution=
*|block|cyclic|arbitrary|plane=<options>
[:*|block|cyclic|fcyclic[:*|block|
cyclic|fcyclic]][,Pack|NoPack]
Specify alternate distribution methods for remote processes.
This option controls the distribution of tasks to the nodes on
which resources have been allocated, and the distribution of
those resources to tasks for binding (task affinity). The first
distribution method (before the first ":") controls the
distribution of tasks to nodes. The second distribution method
(after the first ":") controls the distribution of allocated
CPUs across sockets for binding to tasks. The third distribution
method (after the second ":") controls the distribution of
allocated CPUs across cores for binding to tasks. The second
and third distributions apply only if task affinity is enabled.
The third distribution is supported only if the task/cgroup
plugin is configured. The default value for each distribution
type is specified by *.
Note that with select/cons_res, the number of CPUs allocated on
each socket and node may be different. Refer to
http://slurm.schedmd.com/mc_support.html for more information on
resource allocation, distribution of tasks to nodes, and binding
of tasks to CPUs.
First distribution method (distribution of tasks across nodes):
* Use the default method for distributing tasks to nodes
(block).
block The block distribution method will distribute tasks to a
node such that consecutive tasks share a node. For
example, consider an allocation of three nodes each with
two cpus. A four-task block distribution request will
distribute those tasks to the nodes with tasks one and
two on the first node, task three on the second node, and
task four on the third node. Block distribution is the
default behavior if the number of tasks exceeds the
number of allocated nodes.
cyclic The cyclic distribution method will distribute tasks to a
node such that consecutive tasks are distributed over
consecutive nodes (in a round-robin fashion). For
example, consider an allocation of three nodes each with
two cpus. A four-task cyclic distribution request will
distribute those tasks to the nodes with tasks one and
four on the first node, task two on the second node, and
task three on the third node. Note that when SelectType
is select/cons_res, the same number of CPUs may not be
allocated on each node. Task distribution will be
round-robin among all the nodes with CPUs yet to be
assigned to tasks. Cyclic distribution is the default
behavior if the number of tasks is no larger than the
number of allocated nodes.
plane The tasks are distributed in blocks of a specified size.
The options include a number representing the size of the
task block. This is followed by an optional
specification of the task distribution scheme within a
block of tasks and between the blocks of tasks. The
number of tasks distributed to each node is the same as
for cyclic distribution, but the taskids assigned to each
node depend on the plane size. For more details
(including examples and diagrams), please see
http://slurm.schedmd.com/mc_support.html
and
http://slurm.schedmd.com/dist_plane.html
arbitrary
The arbitrary method of distribution will allocate
processes in-order as listed in file designated by the
environment variable SLURM_HOSTFILE. If this variable is
listed it will over ride any other method specified. If
not set the method will default to block. Inside the
hostfile must contain at minimum the number of hosts
requested and be one per line or comma separated. If
specifying a task count (-n, --ntasks=<number>), your
tasks will be laid out on the nodes in the order of the
file.
NOTE: The arbitrary distribution option on a job
allocation only controls the nodes to be allocated to the
job and not the allocation of CPUs on those nodes. This
option is meant primarily to control a job step's task
layout in an existing job allocation for the srun
command.
Second distribution method (distribution of CPUs across sockets
for binding):
* Use the default method for distributing CPUs across
sockets (cyclic).
block The block distribution method will distribute allocated
CPUs consecutively from the same socket for binding to
tasks, before using the next consecutive socket.
cyclic The cyclic distribution method will distribute allocated
CPUs for binding to a given task consecutively from the
same socket, and from the next consecutive socket for the
next task, in a round-robin fashion across sockets.
fcyclic
The fcyclic distribution method will distribute allocated
CPUs for binding to tasks from consecutive sockets in a
round-robin fashion across the sockets.
Third distribution method (distribution of CPUs across cores for
binding):
* Use the default method for distributing CPUs across cores
(inherited from second distribution method).
block The block distribution method will distribute allocated
CPUs consecutively from the same core for binding to
tasks, before using the next consecutive core.
cyclic The cyclic distribution method will distribute allocated
CPUs for binding to a given task consecutively from the
same core, and from the next consecutive core for the
next task, in a round-robin fashion across cores.
fcyclic
The fcyclic distribution method will distribute allocated
CPUs for binding to tasks from consecutive cores in a
round-robin fashion across the cores.
Optional control for task distribution over nodes:
Pack Rather than evenly distributing a job step's tasks evenly
across it's allocated nodes, pack them as tightly as
possible on the nodes.
NoPack Rather than packing a job step's tasks as tightly as
possible on the nodes, distribute them evenly. This user
option will supersede the SelectTypeParameters
CR_Pack_Nodes configuration parameter.
This option applies to job and step allocations.
--mail-type=<type>
Notify user by email when certain event types occur. Valid type
values are NONE, BEGIN, END, FAIL, REQUEUE, ALL (equivalent to
BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst
buffer stage out and teardown completed), TIME_LIMIT,
TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80
(reached 80 percent of time limit), and TIME_LIMIT_50 (reached
50 percent of time limit). Multiple type values may be
specified in a comma separated list. The user to be notified is
indicated with --mail-user. This option applies to job
allocations.
--mail-user=<user>
User to receive email notification of state changes as defined
by --mail-type. The default value is the submitting user. This
option applies to job allocations.
--mcs-label=<mcs>
Used only when the mcs/group plugin is enabled. This parameter
is a group among the groups of the user. Default value is
calculated by the Plugin mcs if it's enabled. This option
applies to job allocations.
--mem=<MB>
Specify the real memory required per node in MegaBytes. Default
value is DefMemPerNode and the maximum value is MaxMemPerNode.
If configured, both of parameters can be seen using the scontrol
show config command. This parameter would generally be used if
whole nodes are allocated to jobs (SelectType=select/linear).
Specifying a memory limit of zero for a job step will restrict
the job step to the amount of memory allocated to the job, but
not remove any of the job's memory allocation from being
available to other job steps. Also see --mem-per-cpu. --mem
and --mem-per-cpu are mutually exclusive.
NOTE: A memory size specification of zero is treated as a
special case and grants the job access to all of the memory on
each node. If the job is allocated multiple nodes in a
heterogeneous cluster, the memory limit on each node will be
that of the node in the allocation with the smallest memory size
(same limit will apply to every node in the job's allocation).
NOTE: Enforcement of memory limits currently relies upon the
task/cgroup plugin or enabling of accounting, which samples
memory use on a periodic basis (data need not be stored, just
collected). In both cases memory use is based upon the job's
Resident Set Size (RSS). A task may exceed the memory limit
until the next periodic accounting sample.
This option applies to job and step allocations.
--mem-per-cpu=<MB>
Minimum memory required per allocated CPU in MegaBytes. Default
value is DefMemPerCPU and the maximum value is MaxMemPerCPU (see
exception below). If configured, both of parameters can be seen
using the scontrol show config command. Note that if the job's
--mem-per-cpu value exceeds the configured MaxMemPerCPU, then
the user's limit will be treated as a memory limit per task;
--mem-per-cpu will be reduced to a value no larger than
MaxMemPerCPU; --cpus-per-task will be set and the value of
--cpus-per-task multiplied by the new --mem-per-cpu value will
equal the original --mem-per-cpu value specified by the user.
This parameter would generally be used if individual processors
are allocated to jobs (SelectType=select/cons_res). If
resources are allocated by the core, socket or whole nodes; the
number of CPUs allocated to a job may be higher than the task
count and the value of --mem-per-cpu should be adjusted
accordingly. Specifying a memory limit of zero for a job step
will restrict the job step to the amount of memory allocated to
the job, but not remove any of the job's memory allocation from
being available to other job steps. Also see --mem. --mem and
--mem-per-cpu are mutually exclusive. This option applies to job
and step allocations.
--mem_bind=[{quiet,verbose},]type
Bind tasks to memory. Used only when the task/affinity plugin is
enabled and the NUMA memory functions are available. Note that
the resolution of CPU and memory binding may differ on some
architectures. For example, CPU binding may be performed at the
level of the cores within a processor while memory binding will
be performed at the level of nodes, where the definition of
"nodes" may differ from system to system. The use of any type
other than "none" or "local" is not recommended. If you want
greater control, try running a simple test code with the options
"--cpu_bind=verbose,none --mem_bind=verbose,none" to determine
the specific configuration.
NOTE: To have Slurm always report on the selected memory binding
for all commands executed in a shell, you can enable verbose
mode by setting the SLURM_MEM_BIND environment variable value to
"verbose".
The following informational environment variables are set when
--mem_bind is in use:
SLURM_MEM_BIND_VERBOSE
SLURM_MEM_BIND_TYPE
SLURM_MEM_BIND_LIST
See the ENVIRONMENT VARIABLES section for a more detailed
description of the individual SLURM_MEM_BIND* variables.
Supported options include:
q[uiet]
quietly bind before task runs (default)
v[erbose]
verbosely report binding before task runs
no[ne] don't bind tasks to memory (default)
rank bind by task rank (not recommended)
local Use memory local to the processor in use
map_mem:<list>
bind by mapping a node's memory to tasks as specified
where <list> is <cpuid1>,<cpuid2>,...<cpuidN>. CPU IDs
are interpreted as decimal values unless they are
preceded with '0x' in which case they interpreted as
hexadecimal values (not recommended)
mask_mem:<list>
bind by setting memory masks on tasks as specified where
<list> is <mask1>,<mask2>,...<maskN>. memory masks are
always interpreted as hexadecimal values. Note that
masks must be preceded with a '0x' if they don't begin
with [0-9] so they are seen as numerical values by srun.
help show this help message
This option applies to job and step allocations.
--mincpus=<n>
Specify a minimum number of logical cpus/processors per node.
This option applies to job allocations.
--msg-timeout=<seconds>
Modify the job launch message timeout. The default value is
MessageTimeout in the Slurm configuration file slurm.conf.
Changes to this are typically not recommended, but could be
useful to diagnose problems. This option applies to job
allocations.
--mpi=<mpi_type>
Identify the type of MPI to be used. May result in unique
initiation procedures.
list Lists available mpi types to choose from.
lam Initiates one 'lamd' process per node and establishes
necessary environment variables for LAM/MPI.
mpich1_shmem
Initiates one process per node and establishes necessary
environment variables for mpich1 shared memory model.
This also works for mvapich built for shared memory.
mpichgm
For use with Myrinet.
mvapich
For use with Infiniband.
openmpi
For use with OpenMPI.
pmi2 To enable PMI2 support. The PMI2 support in Slurm works
only if the MPI implementation supports it, in other
words if the MPI has the PMI2 interface implemented. The
--mpi=pmi2 will load the library lib/slurm/mpi_pmi2.so
which provides the server side functionality but the
client side must implement PMI2_Init() and the other
interface calls.
pmix To enable PMIx support (http://pmix.github.io/master).
The PMIx support in Slurm can be used to launch parallel
applications (e.g. MPI) if it supports PMIx, PMI2 or
PMI1. Slurm must be configured with pmix support by
passing "--with-pmix=<PMIx installation path>" option to
its "./configure" script.
At the time of writing PMIx is supported in Open MPI
starting from version 2.0. PMIx also supports backward
compatibility with PMI1 and PMI2 and can be used if MPI
was configured with PMI2/PMI1 support pointing to the
PMIx library ("libpmix"). If MPI supports PMI1/PMI2 but
doesn't provide the way to point to a specific
implementation, a hack'ish solution leveraging LD_PRELOAD
can be used to force "libpmix" usage.
none No special MPI processing. This is the default and works
with many other versions of MPI.
This option applies to step allocations.
--multi-prog
Run a job with different programs and different arguments for
each task. In this case, the executable program specified is
actually a configuration file specifying the executable and
arguments for each task. See MULTIPLE PROGRAM CONFIGURATION
below for details on the configuration file contents. This
option applies to step allocations.
-N, --nodes=<minnodes[-maxnodes]>
Request that a minimum of minnodes nodes be allocated to this
job. A maximum node count may also be specified with maxnodes.
If only one number is specified, this is used as both the
minimum and maximum node count. The partition's node limits
supersede those of the job. If a job's node limits are outside
of the range permitted for its associated partition, the job
will be left in a PENDING state. This permits possible
execution at a later time, when the partition limit is changed.
If a job node limit exceeds the number of nodes configured in
the partition, the job will be rejected. Note that the
environment variable SLURM_JOB_NUM_NODES (and SLURM_NNODES for
backwards compatibility) will be set to the count of nodes
actually allocated to the job. See the ENVIRONMENT VARIABLES
section for more information. If -N is not specified, the
default behavior is to allocate enough nodes to satisfy the
requirements of the -n and -c options. The job will be
allocated as many nodes as possible within the range specified
and without delaying the initiation of the job. The node count
specification may include a numeric value followed by a suffix
of "k" (multiplies numeric value by 1,024) or "m" (multiplies
numeric value by 1,048,576). This option applies to job and step
allocations.
-n, --ntasks=<number>
Specify the number of tasks to run. Request that srun allocate
resources for ntasks tasks. The default is one task per node,
but note that the --cpus-per-task option will change this
default. This option applies to job and step allocations.
--network=<type>
Specify information pertaining to the switch or network. The
interpretation of type is system dependent. This option is
supported when running Slurm on a Cray natively. It is used to
request using Network Performace Counters. Only one value per
request is valid. All options are case in-sensitive. In this
configuration supported values include:
system
Use the system-wide network performance counters. Only
nodes requested will be marked in use for the job
allocation. If the job does not fill up the entire system
the rest of the nodes are not able to be used by other
jobs using NPC, if idle their state will appear as
PerfCnts. These nodes are still available for other jobs
not using NPC.
blade Use the blade network performance counters. Only nodes
requested will be marked in use for the job allocation.
If the job does not fill up the entire blade(s) allocated
to the job those blade(s) are not able to be used by other
jobs using NPC, if idle their state will appear as
PerfCnts. These nodes are still available for other jobs
not using NPC.
In all cases the job or step allocation request must
specify the
--exclusive option. Otherwise the request will be denied.
Also with any of these options steps are not allowed to share
blades, so resources would remain idle inside an allocation if
the step running on a blade does not take up all the nodes on
the blade.
The network option is also supported on systems with IBM's
Parallel Environment (PE). See IBM's LoadLeveler job command
keyword documentation about the keyword "network" for more
information. Multiple values may be specified in a comma
separated list. All options are case in-sensitive. Supported
values include:
BULK_XFER[=<resources>]
Enable bulk transfer of data using Remote Direct-
Memory Access (RDMA). The optional resources
specification is a numeric value which can have a
suffix of "k", "K", "m", "M", "g" or "G" for
kilobytes, megabytes or gigabytes. NOTE: The
resources specification is not supported by the
underlying IBM infrastructure as of Parallel
Environment version 2.2 and no value should be
specified at this time. The devices allocated to a
job must all be of the same type. The default value
depends upon depends upon what hardware is available
and in order of preferences is IPONLY (which is not
considered in User Space mode), HFI, IB, HPCE, and
KMUX.
CAU=<count> Number of Collective Acceleration Units (CAU)
required. Applies only to IBM Power7-IH processors.
Default value is zero. Independent CAU will be
allocated for each programming interface (MPI, LAPI,
etc.)
DEVNAME=<name>
Specify the device name to use for communications
(e.g. "eth0" or "mlx4_0").
DEVTYPE=<type>
Specify the device type to use for communications.
The supported values of type are: "IB" (InfiniBand),
"HFI" (P7 Host Fabric Interface), "IPONLY" (IP-Only
interfaces), "HPCE" (HPC Ethernet), and "KMUX"
(Kernel Emulation of HPCE). The devices allocated
to a job must all be of the same type. The default
value depends upon depends upon what hardware is
available and in order of preferences is IPONLY
(which is not considered in User Space mode), HFI,
IB, HPCE, and KMUX.
IMMED =<count>
Number of immediate send slots per window required.
Applies only to IBM Power7-IH processors. Default
value is zero.
INSTANCES =<count>
Specify number of network connections for each task
on each network connection. The default instance
count is 1.
IPV4 Use Internet Protocol (IP) version 4 communications
(default).
IPV6 Use Internet Protocol (IP) version 6 communications.
LAPI Use the LAPI programming interface.
MPI Use the MPI programming interface. MPI is the
default interface.
PAMI Use the PAMI programming interface.
SHMEM Use the OpenSHMEM programming interface.
SN_ALL Use all available switch networks (default).
SN_SINGLE Use one available switch network.
UPC Use the UPC programming interface.
US Use User Space communications.
Some examples of network specifications:
Instances=2,US,MPI,SN_ALL
Create two user space connections for MPI
communications on every switch network for each
task.
US,MPI,Instances=3,Devtype=IB
Create three user space connections for MPI
communications on every InfiniBand network for each
task.
IPV4,LAPI,SN_Single
Create a IP version 4 connection for LAPI
communications on one switch network for each task.
Instances=2,US,LAPI,MPI
Create two user space connections each for LAPI and
MPI communications on every switch network for each
task. Note that SN_ALL is the default option so
every switch network is used. Also note that
Instances=2 specifies that two connections are
established for each protocol (LAPI and MPI) and
each task. If there are two networks and four tasks
on the node then a total of 32 connections are
established (2 instances x 2 protocols x 2 networks
x 4 tasks).
This option applies to job and step allocations.
--nice[=adjustment]
Run the job with an adjusted scheduling priority within Slurm.
With no adjustment value the scheduling priority is decreased by
100. The adjustment range is from -10000 (highest priority) to
10000 (lowest priority). Only privileged users can specify a
negative adjustment. NOTE: This option is presently ignored if
SchedulerType=sched/wiki or SchedulerType=sched/wiki2. This
option applies to job allocations.
--ntasks-per-core=<ntasks>
Request the maximum ntasks be invoked on each core. This option
applies to the job allocation, but not to step allocations.
Meant to be used with the --ntasks option. Related to
--ntasks-per-node except at the core level instead of the node
level. Masks will automatically be generated to bind the tasks
to specific core unless --cpu_bind=none is specified. NOTE:
This option is not supported unless SelectTypeParameters=CR_Core
or SelectTypeParameters=CR_Core_Memory is configured. This
option applies to job allocations.
--ntasks-per-node=<ntasks>
Request that ntasks be invoked on each node. If used with the
--ntasks option, the --ntasks option will take precedence and
the --ntasks-per-node will be treated as a maximum count of
tasks per node. Meant to be used with the --nodes option. This
is related to --cpus-per-task=ncpus, but does not require
knowledge of the actual number of cpus on each node. In some
cases, it is more convenient to be able to request that no more
than a specific number of tasks be invoked on each node.
Examples of this include submitting a hybrid MPI/OpenMP app
where only one MPI "task/rank" should be assigned to each node
while allowing the OpenMP portion to utilize all of the
parallelism present in the node, or submitting a single
setup/cleanup/monitoring job to each node of a pre-existing
allocation as one step in a larger job script. This option
applies to job allocations.
--ntasks-per-socket=<ntasks>
Request the maximum ntasks be invoked on each socket. This
option applies to the job allocation, but not to step
allocations. Meant to be used with the --ntasks option.
Related to --ntasks-per-node except at the socket level instead
of the node level. Masks will automatically be generated to
bind the tasks to specific sockets unless --cpu_bind=none is
specified. NOTE: This option is not supported unless
SelectTypeParameters=CR_Socket or
SelectTypeParameters=CR_Socket_Memory is configured. This option
applies to job allocations.
-O, --overcommit
Overcommit resources. This option applies to job and step
allocations. When applied to job allocation, only one CPU is
allocated to the job per node and options used to specify the
number of tasks per node, socket, core, etc. are ignored. When
applied to job step allocations (the srun command when executed
within an existing job allocation), this option can be used to
launch more than one task per CPU. Normally, srun will not
allocate more than one process per CPU. By specifying
--overcommit you are explicitly allowing more than one process
per CPU. However no more than MAX_TASKS_PER_NODE tasks are
permitted to execute per node. NOTE: MAX_TASKS_PER_NODE is
defined in the file slurm.h and is not a variable, it is set at
Slurm build time.
-o, --output=<mode>
Specify the mode for stdout redirection. By default in
interactive mode, srun collects stdout from all tasks and sends
this output via TCP/IP to the attached terminal. With --output
stdout may be redirected to a file, to one file per task, or to
/dev/null. See section IO Redirection below for the various
forms of mode. If the specified file already exists, it will be
overwritten.
If --error is not also specified on the command line, both
stdout and stderr will directed to the file specified by
--output. This option applies to job and step allocations.
--open-mode=<append|truncate>
Open the output and error files using append or truncate mode as
specified. The default value is specified by the system
configuration parameter JobFileAppend. This option applies to
job allocations.
-p, --partition=<partition_names>
Request a specific partition for the resource allocation. If
not specified, the default behavior is to allow the slurm
controller to select the default partition as designated by the
system administrator. If the job can use more than one
partition, specify their names in a comma separate list and the
one offering earliest initiation will be used with no regard
given to the partition name ordering (although higher priority
partitions will be considered first). When the job is
initiated, the name of the partition used will be placed first
in the job record partition string. This option applies to job
allocations.
--power=<flags>
Comma separated list of power management plugin options.
Currently available flags include: level (all nodes allocated to
the job should have identical power caps, may be disabled by the
Slurm configuration option PowerParameters=job_no_level). This
option applies to job allocations.
--priority=<value>
Request a specific job priority. May be subject to
configuration specific constraints. Only Slurm operators and
administrators can set the priority of a job. This option
applies to job allocations.
--profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
enables detailed data collection by the acct_gather_profile
plugin. Detailed data are typically time-series that are stored
in an HDF5 file for the job.
All All data types are collected. (Cannot be combined with
other values.)
None No data types are collected. This is the default.
(Cannot be combined with other values.)
Energy Energy data is collected.
Task Task (I/O, Memory, ...) data is collected.
Filesystem
Filesystem data is collected.
Network Network (InfiniBand) data is collected.
This option applies to job and step allocations.
--prolog=<executable>
srun will run executable just before launching the job step.
The command line arguments for executable will be the command
and arguments of the job step. If executable is "none", then no
srun prolog will be run. This parameter overrides the SrunProlog
parameter in slurm.conf. This parameter is completely
independent from the Prolog parameter in slurm.conf. This option
applies to job allocations.
--propagate[=rlimits]
Allows users to specify which of the modifiable (soft) resource
limits to propagate to the compute nodes and apply to their
jobs. If rlimits is not specified, then all resource limits
will be propagated. The following rlimit names are supported by
Slurm (although some options may not be supported on some
systems):
ALL All limits listed below
AS The maximum address space for a process
CORE The maximum size of core file
CPU The maximum amount of CPU time
DATA The maximum size of a process's data segment
FSIZE The maximum size of files created. Note that if the
user sets FSIZE to less than the current size of the
slurmd.log, job launches will fail with a 'File size
limit exceeded' error.
MEMLOCK The maximum size that may be locked into memory
NOFILE The maximum number of open files
NPROC The maximum number of processes available
RSS The maximum resident set size
STACK The maximum stack size
This option applies to job allocations.
--pty Execute task zero in pseudo terminal mode. Implicitly sets
--unbuffered. Implicitly sets --error and --output to /dev/null
for all tasks except task zero, which may cause those tasks to
exit immediately (e.g. shells will typically exit immediately in
that situation). Not currently supported on AIX platforms. This
option applies to step allocations.
-Q, --quiet
Suppress informational messages from srun. Errors will still be
displayed. This option applies to job and step allocations.
-q, --quit-on-interrupt
Quit immediately on single SIGINT (Ctrl-C). Use of this option
disables the status feature normally available when srun
receives a single Ctrl-C and causes srun to instead immediately
terminate the running job. This option applies to step
allocations.
--qos=<qos>
Request a quality of service for the job. QOS values can be
defined for each user/cluster/account association in the Slurm
database. Users will be limited to their association's defined
set of qos's when the Slurm configuration parameter,
AccountingStorageEnforce, includes "qos" in it's definition.
This option applies to job allocations.
-r, --relative=<n>
Run a job step relative to node n of the current allocation.
This option may be used to spread several job steps out among
the nodes of the current job. If -r is used, the current job
step will begin at node n of the allocated nodelist, where the
first node is considered node 0. The -r option is not permitted
with -w or -x option and will result in a fatal error when not
running within a prior allocation (i.e. when SLURM_JOB_ID is not
set). The default for n is 0. If the value of --nodes exceeds
the number of nodes identified with the --relative option, a
warning message will be printed and the --relative option will
take precedence. This option applies to step allocations.
--reboot
Force the allocated nodes to reboot before starting the job.
This is only supported with some system configurations and will
otherwise be silently ignored. This option applies to job
allocations.
--resv-ports
Reserve communication ports for this job. Users can specify the
number of port they want to reserve. The parameter
MpiParams=ports=12000-12999 must be specified in slurm.conf. If
not specified the default reserve number of ports equal to the
number of tasks. If the number of reserved ports is zero no
ports is reserved. Used for OpenMPI. This option applies to job
and step allocations.
--reservation=<name>
Allocate resources for the job from the named reservation. This
option applies to job allocations.
--restart-dir=<directory>
Specifies the directory from which the job or job step's
checkpoint should be read (used by the checkpoint/blcrm and
checkpoint/xlch plugins only). This option applies to job
allocations.
--share The --share option has been replaced by the
--oversubscribe option described below.
-s, --oversubscribe
The job allocation can over-subscribe resources with other
running jobs. The resources to be over-subscribed can be nodes,
sockets, cores, and/or hyperthreads depending upon
configuration. The default over-subscribe behavior depends on
system configuration and the partition's OverSubscribe option
takes precedence over the job's option. This option may result
in the allocation being granted sooner than if the
--oversubscribe option was not set and allow higher system
utilization, but application performance will likely suffer due
to competition for resources. Also see the --exclusive option.
This option applies to step allocations.
-S, --core-spec=<num>
Count of specialized cores per node reserved by the job for
system operations and not used by the application. The
application will not use these cores, but will be charged for
their allocation. Default value is dependent upon the node's
configured CoreSpecCount value. If a value of zero is
designated and the Slurm configuration option
AllowSpecResourcesUsage is enabled, the job will be allowed to
override CoreSpecCount and use the specialized resources on
nodes it is allocated. This option can not be used with the
--thread-spec option. This option applies to job allocations.
--signal=<sig_num>[@<sig_time>]
When a job is within sig_time seconds of its end time, send it
the signal sig_num. Due to the resolution of event handling by
Slurm, the signal may be sent up to 60 seconds earlier than
specified. sig_num may either be a signal number or name (e.g.
"10" or "USR1"). sig_time must have an integer value between 0
and 65535. By default, no signal is sent before the job's end
time. If a sig_num is specified without any sig_time, the
default time will be 60 seconds. This option applies to job
allocations.
--slurmd-debug=<level>
Specify a debug level for slurmd(8). The level may be specified
either an integer value between 0 [quiet, only errors are
displayed] and 4 [verbose operation] or the SlurmdDebug tags.
quiet Log nothing
fatal Log only fatal errors
error Log only errors
info Log errors and general informational messages
verbose Log errors and verbose informational messages
The slurmd debug information is copied onto the stderr of
the job. By default only errors are displayed. This option
applies to job and step allocations.
--sockets-per-node=<sockets>
Restrict node selection to nodes with at least the specified
number of sockets. See additional information under -B option
above when task/affinity plugin is enabled. This option applies
to job allocations.
--switches=<count>[@<max-time>]
When a tree topology is used, this defines the maximum count of
switches desired for the job allocation and optionally the
maximum time to wait for that number of switches. If Slurm finds
an allocation containing more switches than the count specified,
the job remains pending until it either finds an allocation with
desired switch count or the time limit expires. It there is no
switch count limit, there is no delay in starting the job.
Acceptable time formats include "minutes", "minutes:seconds",
"hours:minutes:seconds", "days-hours", "days-hours:minutes" and
"days-hours:minutes:seconds". The job's maximum time delay may
be limited by the system administrator using the
SchedulerParameters configuration parameter with the
max_switch_wait parameter option. The default max-time is the
max_switch_wait SchedulerParameters. This option applies to job
allocations.
-T, --threads=<nthreads>
Allows limiting the number of concurrent threads used to send
the job request from the srun process to the slurmd processes on
the allocated nodes. Default is to use one thread per allocated
node up to a maximum of 60 concurrent threads. Specifying this
option limits the number of concurrent threads to nthreads (less
than or equal to 60). This should only be used to set a low
thread count for testing on very small memory computers. This
option applies to job allocations.
-t, --time=<time>
Set a limit on the total run time of the job allocation. If the
requested time limit exceeds the partition's time limit, the job
will be left in a PENDING state (possibly indefinitely). The
default time limit is the partition's default time limit. When
the time limit is reached, each task in each job step is sent
SIGTERM followed by SIGKILL. The interval between signals is
specified by the Slurm configuration parameter KillWait. The
OverTimeLimit configuration parameter may permit the job to run
longer than scheduled. Time resolution is one minute and second
values are rounded up to the next minute.
A time limit of zero requests that no time limit be imposed.
Acceptable time formats include "minutes", "minutes:seconds",
"hours:minutes:seconds", "days-hours", "days-hours:minutes" and
"days-hours:minutes:seconds". This option applies to job and
step allocations.
--task-epilog=<executable>
The slurmstepd daemon will run executable just after each task
terminates. This will be executed before any TaskEpilog
parameter in slurm.conf is executed. This is meant to be a very
short-lived program. If it fails to terminate within a few
seconds, it will be killed along with any descendant processes.
This option applies to step allocations.
--task-prolog=<executable>
The slurmstepd daemon will run executable just before launching
each task. This will be executed after any TaskProlog parameter
in slurm.conf is executed. Besides the normal environment
variables, this has SLURM_TASK_PID available to identify the
process ID of the task being started. Standard output from this
program of the form "export NAME=value" will be used to set
environment variables for the task being spawned. This option
applies to step allocations.
--test-only
Returns an estimate of when a job would be scheduled to run
given the current job queue and all the other srun arguments
specifying the job. This limits srun's behavior to just return
information; no job is actually submitted. EXCEPTION: On
Bluegene/Q systems on when running within an existing job
allocation, this disables the use of "runjob" to launch tasks.
The program will be executed directly by the slurmd daemon. This
option applies to job allocations.
--thread-spec=<num>
Count of specialized threads per node reserved by the job for
system operations and not used by the application. The
application will not use these threads, but will be charged for
their allocation. This option can not be used with the
--core-spec option. This option applies to job allocations.
--threads-per-core=<threads>
Restrict node selection to nodes with at least the specified
number of threads per core. NOTE: "Threads" refers to the
number of processing units on each core rather than the number
of application tasks to be launched per core. See additional
information under -B option above when task/affinity plugin is
enabled. This option applies to job allocations.
--time-min=<time>
Set a minimum time limit on the job allocation. If specified,
the job may have it's --time limit lowered to a value no lower
than --time-min if doing so permits the job to begin execution
earlier than otherwise possible. The job's time limit will not
be changed after the job is allocated resources. This is
performed by a backfill scheduling algorithm to allocate
resources otherwise reserved for higher priority jobs.
Acceptable time formats include "minutes", "minutes:seconds",
"hours:minutes:seconds", "days-hours", "days-hours:minutes" and
"days-hours:minutes:seconds". This option applies to job
allocations.
--tmp=<MB>
Specify a minimum amount of temporary disk space. This option
applies to job allocations.
-u, --unbuffered
By default the connection between slurmstepd and the user
launched application is over a pipe. The stdio output written by
the application is buffered by the glibc until it is flushed or
the output is set as unbuffered. See setbuf(3). If this option
is specified the tasks are executed with a pseudo terminal so
that the application output is unbuffered. This option applies
to step allocations.
--usage
Display brief help message and exit.
--uid=<user>
Attempt to submit and/or run a job as user instead of the
invoking user id. The invoking user's credentials will be used
to check access permissions for the target partition. User root
may use this option to run jobs as a normal user in a RootOnly
partition for example. If run as root, srun will drop its
permissions to the uid specified after node allocation is
successful. user may be the user name or numerical user ID. This
option applies to job and step allocations.
-V, --version
Display version information and exit.
-v, --verbose
Increase the verbosity of srun's informational messages.
Multiple -v's will further increase srun's verbosity. By
default only errors will be displayed. This option applies to
job and step allocations.
-W, --wait=<seconds>
Specify how long to wait after the first task terminates before
terminating all remaining tasks. A value of 0 indicates an
unlimited wait (a warning will be issued after 60 seconds). The
default value is set by the WaitTime parameter in the slurm
configuration file (see slurm.conf(5)). This option can be
useful to insure that a job is terminated in a timely fashion in
the event that one or more tasks terminate prematurely. Note:
The -K, --kill-on-bad-exit option takes precedence over -W,
--wait to terminate the job immediately if a task exits with a
non-zero exit code. This option applies to job allocations.
-w, --nodelist=<host1,host2,... or filename>
Request a specific list of hosts. The job will contain all of
these hosts and possibly additional hosts as needed to satisfy
resource requirements. The list may be specified as a
comma-separated list of hosts, a range of hosts (host[1-5,7,...]
for example), or a filename. The host list will be assumed to
be a filename if it contains a "/" character. If you specify a
minimum node or processor count larger than can be satisfied by
the supplied host list, additional resources will be allocated
on other nodes as needed. Rather than repeating a host name
multiple times, an asterisk and a repetition count may be
appended to a host name. For example "host1,host1" and "host1*2"
are equivalent. This option applies to job and step allocations.
--wckey=<wckey>
Specify wckey to be used with job. If TrackWCKey=no (default)
in the slurm.conf this value is ignored. This option applies to
job allocations.
-X, --disable-status
Disable the display of task status when srun receives a single
SIGINT (Ctrl-C). Instead immediately forward the SIGINT to the
running job. Without this option a second Ctrl-C in one second
is required to forcibly terminate the job and srun will
immediately exit. May also be set via the environment variable
SLURM_DISABLE_STATUS. This option applies to job allocations.
-x, --exclude=<host1,host2,... or filename>
Request that a specific list of hosts not be included in the
resources allocated to this job. The host list will be assumed
to be a filename if it contains a "/"character. This option
applies to job allocations.
-Z, --no-allocate
Run the specified tasks on a set of nodes without creating a
Slurm "job" in the Slurm queue structure, bypassing the normal
resource allocation step. The list of nodes must be specified
with the -w, --nodelist option. This is a privileged option
only available for the users "SlurmUser" and "root". This option
applies to job allocations.
The following options support Blue Gene systems, but may be applicable
to other systems as well.
--blrts-image=<path>
Path to blrts image for bluegene block. BGL only. Default from
blugene.conf if not set. This option applies to job allocations.
--cnload-image=<path>
Path to compute node image for bluegene block. BGP only.
Default from blugene.conf if not set. This option applies to job
allocations.
--conn-type=<type>
Require the block connection type to be of a certain type. On
Blue Gene the acceptable of type are MESH, TORUS and NAV. If
NAV, or if not set, then Slurm will try to fit a what the
DefaultConnType is set to in the bluegene.conf if that isn't set
the default is TORUS. You should not normally set this option.
If running on a BGP system and wanting to run in HTC mode (only
for 1 midplane and below). You can use HTC_S for SMP, HTC_D for
Dual, HTC_V for virtual node mode, and HTC_L for Linux mode.
For systems that allow a different connection type per dimension
you can supply a comma separated list of connection types may be
specified, one for each dimension (i.e. M,T,T,T will give you a
torus connection is all dimensions expect the first). This
option applies to job allocations.
-g, --geometry=<XxYxZ> | <AxXxYxZ>
Specify the geometry requirements for the job. On BlueGene/L and
BlueGene/P systems there are three numbers giving dimensions in
the X, Y and Z directions, while on BlueGene/Q systems there are
four numbers giving dimensions in the A, X, Y and Z directions
and can not be used to allocate sub-blocks. For example
"--geometry=1x2x3x4", specifies a block of nodes having 1 x 2 x
3 x 4 = 24 nodes (actually midplanes on BlueGene). This option
applies to job allocations.
--ioload-image=<path>
Path to io image for bluegene block. BGP only. Default from
blugene.conf if not set. This option applies to job allocations.
--linux-image=<path>
Path to linux image for bluegene block. BGL only. Default from
blugene.conf if not set. This option applies to job allocations.
--mloader-image=<path>
Path to mloader image for bluegene block. Default from
blugene.conf if not set. This option applies to job allocations.
-R, --no-rotate
Disables rotation of the job's requested geometry in order to
fit an appropriate block. By default the specified geometry can
rotate in three dimensions. This option applies to job
allocations.
--ramdisk-image=<path>
Path to ramdisk image for bluegene block. BGL only. Default
from blugene.conf if not set. This option applies to job
allocations.
srun will submit the job request to the slurm job controller, then
initiate all processes on the remote nodes. If the request cannot be
met immediately, srun will block until the resources are free to run
the job. If the -I (--immediate) option is specified srun will
terminate if resources are not immediately available.
When initiating remote processes srun will propagate the current
working directory, unless --chdir=<path> is specified, in which case
path will become the working directory for the remote processes.
The -n, -c, and -N options control how CPUs and nodes will be
allocated to the job. When specifying only the number of processes to
run with -n, a default of one CPU per process is allocated. By
specifying the number of CPUs required per task (-c), more than one CPU
may be allocated per process. If the number of nodes is specified with
-N, srun will attempt to allocate at least the number of nodes
specified.
Combinations of the above three options may be used to change how
processes are distributed across nodes and cpus. For instance, by
specifying both the number of processes and number of nodes on which to
run, the number of processes per node is implied. However, if the
number of CPUs per process is more important then number of processes
(-n) and the number of CPUs per process (-c) should be specified.
srun will refuse to allocate more than one process per CPU unless
--overcommit (-O) is also specified.
srun will attempt to meet the above specifications "at a minimum." That
is, if 16 nodes are requested for 32 processes, and some nodes do not
have 2 CPUs, the allocation of nodes will be increased in order to meet
the demand for CPUs. In other words, a minimum of 16 nodes are being
requested. However, if 16 nodes are requested for 15 processes, srun
will consider this an error, as 15 processes cannot run across 16
nodes.
IO Redirection
By default, stdout and stderr will be redirected from all tasks to the
stdout and stderr of srun, and stdin will be redirected from the
standard input of srun to all remote tasks. If stdin is only to be
read by a subset of the spawned tasks, specifying a file to read from
rather than forwarding stdin from the srun command may be preferable as
it avoids moving and storing data that will never be read.
For OS X, the poll() function does not support stdin, so input from a
terminal is not possible.
For BGQ srun only supports stdin to 1 task running on the system. By
default it is taskid 0 but can be changed with the -i<taskid> as
described below, or --launcher-opts="--stdinrank=<taskid>".
This behavior may be changed with the --output, --error, and --input
(-o, -e, -i) options. Valid format specifications for these options are
all stdout stderr is redirected from all tasks to srun. stdin is
broadcast to all remote tasks. (This is the default
behavior)
none stdout and stderr is not received from any task. stdin is
not sent to any task (stdin is closed).
taskid stdout and/or stderr are redirected from only the task with
relative id equal to taskid, where 0 <= taskid <= ntasks,
where ntasks is the total number of tasks in the current job
step. stdin is redirected from the stdin of srun to this
same task. This file will be written on the node executing
the task.
filename srun will redirect stdout and/or stderr to the named file
from all tasks. stdin will be redirected from the named file
and broadcast to all tasks in the job. filename refers to a
path on the host that runs srun. Depending on the cluster's
file system layout, this may result in the output appearing
in different places depending on whether the job is run in
batch mode.
format string
srun allows for a format string to be used to generate the
named IO file described above. The following list of format
specifiers may be used in the format string to generate a
filename that will be unique to a given jobid, stepid, node,
or task. In each case, the appropriate number of files are
opened and associated with the corresponding tasks. Note that
any format string containing %t, %n, and/or %N will be
written on the node executing the task rather than the node
where srun executes, these format specifiers are not
supported on a BGQ system.
\\ Do not process any of the replacement symbols.
%% The character "%".
%A Job array's master job allocation number.
%a Job array ID (index) number.
%J jobid.stepid of the running job. (e.g. "128.0")
%j jobid of the running job.
%s stepid of the running job.
%N short hostname. This will create a separate IO file
per node.
%n Node identifier relative to current job (e.g. "0" is
the first node of the running job) This will create a
separate IO file per node.
%t task identifier (rank) relative to current job. This
will create a separate IO file per task.
%u User name.
A number placed between the percent character and format
specifier may be used to zero-pad the result in the IO
filename. This number is ignored if the format specifier
corresponds to non-numeric data (%N for example).
Some examples of how the format string may be used for a 4
task job step with a Job ID of 128 and step id of 0 are
included below:
job%J.out job128.0.out
job%4j.out job0128.out
job%j-%2t.out job128-00.out, job128-01.out, ...
Some srun options may be set via environment variables. These
environment variables, along with their corresponding options, are
listed below. Note: Command line options will always override these
settings.
PMI_FANOUT This is used exclusively with PMI (MPICH2 and
MVAPICH2) and controls the fanout of data
communications. The srun command sends messages
to application programs (via the PMI library) and
those applications may be called upon to forward
that data to up to this number of additional
tasks. Higher values offload work from the srun
command to the applications and likely increase
the vulnerability to failures. The default value
is 32.
PMI_FANOUT_OFF_HOST This is used exclusively with PMI (MPICH2 and
MVAPICH2) and controls the fanout of data
communications. The srun command sends messages
to application programs (via the PMI library) and
those applications may be called upon to forward
that data to additional tasks. By default, srun
sends one message per host and one task on that
host forwards the data to other tasks on that
host up to PMI_FANOUT. If PMI_FANOUT_OFF_HOST is
defined, the user task may be required to forward
the data to tasks on other hosts. Setting
PMI_FANOUT_OFF_HOST may increase performance.
Since more work is performed by the PMI library
loaded by the user application, failures also can
be more common and more difficult to diagnose.
PMI_TIME This is used exclusively with PMI (MPICH2 and
MVAPICH2) and controls how much the
communications from the tasks to the srun are
spread out in time in order to avoid overwhelming
the srun command with work. The default value is
500 (microseconds) per task. On relatively slow
processors or systems with very large processor
counts (and large PMI data sets), higher values
may be required.
SLURM_CONF The location of the Slurm configuration file.
SLURM_ACCOUNT Same as -A, --account
SLURM_ACCTG_FREQ Same as --acctg-freq
SLURM_BCAST Same as --bcast
SLURM_BLRTS_IMAGE Same as --blrts-image
SLURM_BURST_BUFFER Same as --bb
SLURM_CHECKPOINT Same as --checkpoint
SLURM_CHECKPOINT_DIR Same as --checkpoint-dir
SLURM_CNLOAD_IMAGE Same as --cnload-image
SLURM_COMPRESS Same as --compress
SLURM_CONN_TYPE Same as --conn-type
SLURM_CORE_SPEC Same as --core-spec
SLURM_CPU_BIND Same as --cpu_bind
SLURM_CPU_FREQ_REQ Same as --cpu-freq.
SLURM_CPUS_PER_TASK Same as -c, --cpus-per-task
SLURM_DEBUG Same as -v, --verbose
SlurmD_DEBUG Same as -d, --slurmd-debug
SLURM_DEPENDENCY -P, --dependency=<jobid>
SLURM_DISABLE_STATUS Same as -X, --disable-status
SLURM_DIST_PLANESIZE Same as -m plane
SLURM_DISTRIBUTION Same as -m, --distribution
SLURM_EPILOG Same as --epilog
SLURM_EXCLUSIVE Same as --exclusive
SLURM_EXIT_ERROR Specifies the exit code generated when a Slurm
error occurs (e.g. invalid options). This can be
used by a script to distinguish application exit
codes from various Slurm error conditions. Also
see SLURM_EXIT_IMMEDIATE.
SLURM_EXIT_IMMEDIATE Specifies the exit code generated when the
--immediate option is used and resources are not
currently available. This can be used by a
script to distinguish application exit codes from
various Slurm error conditions. Also see
SLURM_EXIT_ERROR.
SLURM_GEOMETRY Same as -g, --geometry
SLURM_GRES_FLAGS Same as --gres-flags
SLURM_HINT Same as --hint
SLURM_GRES Same as --gres. Also see SLURM_STEP_GRES
SLURM_IMMEDIATE Same as -I, --immediate
SLURM_IOLOAD_IMAGE Same as --ioload-image
SLURM_JOB_ID (and SLURM_JOBID for backwards compatibility)
Same as --jobid
SLURM_JOB_NAME Same as -J, --job-name except within an existing
allocation, in which case it is ignored to avoid
using the batch job's name as the name of each
job step.
SLURM_JOB_NUM_NODES (and SLURM_NNODES for backwards compatibility)
Total number of nodes in the job's resource
allocation.
SLURM_KILL_BAD_EXIT Same as -K, --kill-on-bad-exit
SLURM_LABELIO Same as -l, --label
SLURM_LINUX_IMAGE Same as --linux-image
SLURM_MEM_BIND Same as --mem_bind
SLURM_MEM_PER_CPU Same as --mem-per-cpu
SLURM_MEM_PER_NODE Same as --mem
SLURM_MLOADER_IMAGE Same as --mloader-image
SLURM_MPI_TYPE Same as --mpi
SLURM_NETWORK Same as --network
SLURM_NNODES Same as -N, --nodes
SLURM_NO_ROTATE Same as -R, --no-rotate
SLURM_NTASKS (and SLURM_NPROCS for backwards compatibility)
Same as -n, --ntasks
SLURM_NTASKS_PER_CORE Same as --ntasks-per-core
SLURM_NTASKS_PER_NODE Same as --ntasks-per-node
SLURM_NTASKS_PER_SOCKET
Same as --ntasks-per-socket
SLURM_OPEN_MODE Same as --open-mode
SLURM_OVERCOMMIT Same as -O, --overcommit
SLURM_PARTITION Same as -p, --partition
SLURM_PMI_KVS_NO_DUP_KEYS
If set, then PMI key-pairs will contain no
duplicate keys. MPI can use this variable to
inform the PMI library that it will not use
duplicate keys so PMI can skip the check for
duplicate keys. This is the case for MPICH2 and
reduces overhead in testing for duplicates for
improved performance
SLURM_POWER Same as --power
SLURM_PROFILE Same as --profile
SLURM_PROLOG Same as --prolog
SLURM_QOS Same as --qos
SLURM_RAMDISK_IMAGE Same as --ramdisk-image
SLURM_REMOTE_CWD Same as -D, --chdir=
SLURM_REQ_SWITCH When a tree topology is used, this defines the
maximum count of switches desired for the job
allocation and optionally the maximum time to
wait for that number of switches. See --switches
SLURM_RESERVATION Same as --reservation
SLURM_RESTART_DIR Same as --restart-dir
SLURM_RESV_PORTS Same as --resv-ports
SLURM_SIGNAL Same as --signal
SLURM_STDERRMODE Same as -e, --error
SLURM_STDINMODE Same as -i, --input
SLURM_SRUN_REDUCE_TASK_EXIT_MSG
if set and non-zero, successive task exit
messages with the same exit code will be printed
only once.
SLURM_STEP_GRES Same as --gres (only applies to job steps, not to
job allocations). Also see SLURM_GRES
SLURM_STEP_KILLED_MSG_NODE_ID=ID
If set, only the specified node will log when the
job or step are killed by a signal.
SLURM_STDOUTMODE Same as -o, --output
SLURM_TASK_EPILOG Same as --task-epilog
SLURM_TASK_PROLOG Same as --task-prolog
SLURM_TEST_EXEC if defined, then verify existence of the
executable program on the local computer before
attempting to launch it on compute nodes.
SLURM_THREAD_SPEC Same as --thread-spec
SLURM_THREADS Same as -T, --threads
SLURM_TIMELIMIT Same as -t, --time
SLURM_UNBUFFEREDIO Same as -u, --unbuffered
SLURM_WAIT Same as -W, --wait
SLURM_WAIT4SWITCH Max time waiting for requested switches. See
--switches
SLURM_WCKEY Same as -W, --wckey
SLURM_WORKING_DIR -D, --chdir
srun will set some environment variables in the environment of the
executing tasks on the remote compute nodes. These environment
variables are:
SLURM_CHECKPOINT_IMAGE_DIR
Directory into which checkpoint images should be
written if specified on the execute line.
SLURM_CLUSTER_NAME Name of the cluster on which the job is
executing.
SLURM_CPU_BIND_VERBOSE
--cpu_bind verbosity (quiet,verbose).
SLURM_CPU_BIND_TYPE --cpu_bind type (none,rank,map_cpu:,mask_cpu:).
SLURM_CPU_BIND_LIST --cpu_bind map or mask list (list of Slurm CPU
IDs or masks for this node, CPU_ID = Board_ID x
threads_per_board + Socket_ID x
threads_per_socket + Core_ID x threads_per_core +
Thread_ID).
SLURM_CPU_FREQ_REQ Contains the value requested for cpu frequency on
the srun command as a numerical frequency in
kilohertz, or a coded value for a request of low,
medium,highm1 or high for the frequency. See the
description of the --cpu-freq option or the
SLURM_CPU_FREQ_REQ input environment variable.
SLURM_CPUS_ON_NODE Count of processors available to the job on this
node. Note the select/linear plugin allocates
entire nodes to jobs, so the value indicates the
total count of CPUs on the node. For the
select/cons_res plugin, this number indicates the
number of cores on this node allocated to the
job.
SLURM_CPUS_PER_TASK Number of cpus requested per task. Only set if
the --cpus-per-task option is specified.
SLURM_DISTRIBUTION Distribution type for the allocated jobs. Set the
distribution with -m, --distribution.
SLURM_GTIDS Global task IDs running on this node. Zero
origin and comma separated.
SLURM_JOB_CPUS_PER_NODE
Number of CPUS per node.
SLURM_JOB_DEPENDENCY Set to value of the --dependency option.
SLURM_JOB_ID (and SLURM_JOBID for backwards compatibility)
Job id of the executing job.
SLURM_JOB_NAME Set to the value of the --job-name option or the
command name when srun is used to create a new
job allocation. Not set when srun is used only to
create a job step (i.e. within an existing job
allocation).
SLURM_JOB_PARTITION Name of the partition in which the job is
running.
SLURM_LAUNCH_NODE_IPADDR
IP address of the node from which the task launch
was initiated (where the srun command ran from).
SLURM_LOCALID Node local task ID for the process within a job.
SLURM_MEM_BIND_VERBOSE
--mem_bind verbosity (quiet,verbose).
SLURM_MEM_BIND_TYPE --mem_bind type (none,rank,map_mem:,mask_mem:).
SLURM_MEM_BIND_LIST --mem_bind map or mask list (<list of IDs or
masks for this node>).
SLURM_NNODES Total number of nodes in the job's resource
allocation.
SLURM_NODE_ALIASES Sets of node name, communication address and
hostname for nodes allocated to the job from the
cloud. Each element in the set if colon separated
and each set is comma separated. For example:
SLURM_NODE_ALIASES=
ec0:1.2.3.4:foo,ec1:1.2.3.5:bar
SLURM_NODEID The relative node ID of the current node.
SLURM_NODELIST List of nodes allocated to the job.
SLURM_NTASKS (and SLURM_NPROCS for backwards compatibility)
Total number of processes in the current job.
SLURM_PRIO_PROCESS The scheduling priority (nice value) at the time
of job submission. This value is propagated to
the spawned processes.
SLURM_PROCID The MPI rank (or relative process ID) of the
current process.
SLURM_SRUN_COMM_HOST IP address of srun communication host.
SLURM_SRUN_COMM_PORT srun communication port.
SLURM_STEP_LAUNCHER_PORT
Step launcher port.
SLURM_STEP_NODELIST List of nodes allocated to the step.
SLURM_STEP_NUM_NODES Number of nodes allocated to the step.
SLURM_STEP_NUM_TASKS Number of processes in the step.
SLURM_STEP_TASKS_PER_NODE
Number of processes per node within the step.
SLURM_STEP_ID (and SLURM_STEPID for backwards compatibility)
The step ID of the current job.
SLURM_SUBMIT_DIR The directory from which srun was invoked.
SLURM_SUBMIT_HOST The hostname of the computer from which salloc
was invoked.
SLURM_TASK_PID The process ID of the task being started.
SLURM_TASKS_PER_NODE Number of tasks to be initiated on each node.
Values are comma separated and in the same order
as SLURM_NODELIST. If two or more consecutive
nodes are to have the same task count, that count
is followed by "(x#)" where "#" is the repetition
count. For example,
"SLURM_TASKS_PER_NODE=2(x3),1" indicates that the
first three nodes will each execute three tasks
and the fourth node will execute one task.
SLURM_TOPOLOGY_ADDR This is set only if the system has the
topology/tree plugin configured. The value will
be set to the names network switches which may be
involved in the job's communications from the
system's top level switch down to the leaf switch
and ending with node name. A period is used to
separate each hardware component name.
SLURM_TOPOLOGY_ADDR_PATTERN
This is set only if the system has the
topology/tree plugin configured. The value will
be set component types listed in
SLURM_TOPOLOGY_ADDR. Each component will be
identified as either "switch" or "node". A
period is used to separate each hardware
component type.
SLURM_UMASK The umask in effect when the job was submitted.
SLURMD_NODENAME Name of the node running the task. In the case of
a parallel job executing on multiple compute
nodes, the various tasks will have this
environment variable set to different values on
each compute node.
SRUN_DEBUG Set to the logging level of the srun command.
Default value is 3 (info level). The value is
incremented or decremented based upon the
--verbose and --quiet options.
MPIRUN_NOALLOCATE Do not allocate a block on Blue Gene systems
only.
MPIRUN_NOFREE Do not free a block on Blue Gene systems only.
MPIRUN_PARTITION The block name on Blue Gene systems only.
Signals sent to the srun command are automatically forwarded to the tasks it is controlling with a few exceptions. The escape sequence <control-c> will report the state of all tasks associated with the srun command. If <control-c> is entered twice within one second, then the associated SIGINT signal will be sent to all tasks and a termination sequence will be entered sending SIGCONT, SIGTERM, and SIGKILL to all spawned tasks. If a third <control-c> is received, the srun program will be terminated without waiting for remote tasks to exit or their I/O to complete. The escape sequence <control-z> is presently ignored. Our intent is for this put the srun command into a mode where various special actions may be invoked.
MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementation. 1. Slurm directly launches the tasks and performs initialization of communications (Quadrics MPI, MPICH2, MPICH-GM, MVAPICH, MVAPICH2 and some MPICH1 modes). For example: "srun -n16 a.out". 2. Slurm creates a resource allocation for the job and then mpirun launches tasks using Slurm's infrastructure (OpenMPI, LAM/MPI, HP-MPI and some MPICH1 modes). 3. Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH (BlueGene MPI and some MPICH1 modes). These tasks initiated outside of Slurm's monitoring or control. Slurm's epilog should be configured to purge these tasks when the job's allocation is relinquished. See http://slurm.schedmd.com/mpi_guide.html for more information on use of these various MPI implementation with Slurm.
Comments in the configuration file must have a "#" in column one. The
configuration file contains the following fields separated by white
space:
Task rank
One or more task ranks to use this configuration. Multiple
values may be comma separated. Ranges may be indicated with two
numbers separated with a '-' with the smaller number first (e.g.
"0-4" and not "4-0"). To indicate all tasks not otherwise
specified, specify a rank of '*' as the last line of the file.
If an attempt is made to initiate a task for which no executable
program is defined, the following error message will be produced
"No executable program specified for this task".
Executable
The name of the program to execute. May be fully qualified
pathname if desired.
Arguments
Program arguments. The expression "%t" will be replaced with
the task's number. The expression "%o" will be replaced with
the task's offset within this range (e.g. a configured task rank
value of "1-5" would have offset values of "0-4"). Single
quotes may be used to avoid having the enclosed values
interpreted. This field is optional. Any arguments for the
program entered on the command line will be added to the
arguments specified in the configuration file.
For example:
###################################################################
# srun multiple program configuration file
#
# srun -n8 -l --multi-prog silly.conf
###################################################################
4-6 hostname
1,7 echo task:%t
0,2-3 echo offset:%o
> srun -n8 -l --multi-prog silly.conf
0: offset:0
1: task:1
2: offset:1
3: offset:2
4: linux15.llnl.gov
5: linux16.llnl.gov
6: linux17.llnl.gov
7: task:7
This simple example demonstrates the execution of the command hostname
in eight tasks. At least eight processors will be allocated to the job
(the same as the task count) on however many nodes are required to
satisfy the request. The output of each task will be proceeded with its
task number. (The machine "dev" in the example below has a total of
two CPUs per node)
> srun -n8 -l hostname
0: dev0
1: dev0
2: dev1
3: dev1
4: dev2
5: dev2
6: dev3
7: dev3
The srun -r option is used within a job script to run two job steps on
disjoint nodes in the following example. The script is run using
allocate mode instead of as a batch job in this case.
> cat test.sh
#!/bin/sh
echo $SLURM_NODELIST
srun -lN2 -r2 hostname
srun -lN2 hostname
> salloc -N4 test.sh
dev[7-10]
0: dev9
1: dev10
0: dev7
1: dev8
The following script runs two job steps in parallel within an allocated
set of nodes.
> cat test.sh
#!/bin/bash
srun -lN2 -n4 -r 2 sleep 60 &
srun -lN2 -r 0 sleep 60 &
sleep 1
squeue
squeue -s
wait
> salloc -N4 test.sh
JOBID PARTITION NAME USER ST TIME NODES NODELIST
65641 batch test.sh grondo R 0:01 4 dev[7-10]
STEPID PARTITION USER TIME NODELIST
65641.0 batch grondo 0:01 dev[7-8]
65641.1 batch grondo 0:01 dev[9-10]
This example demonstrates how one executes a simple MPICH job. We use
srun to build a list of machines (nodes) to be used by mpirun in its
required format. A sample command line and the script to be executed
follow.
> cat test.sh
#!/bin/sh
MACHINEFILE="nodes.$SLURM_JOB_ID"
# Generate Machinefile for mpich such that hosts are in the same
# order as if run via srun
#
srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
# Run using generated Machine file:
mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE mpi-app
rm $MACHINEFILE
> salloc -N2 -n4 test.sh
This simple example demonstrates the execution of different jobs on
different nodes in the same srun. You can do this for any number of
nodes or any number of jobs. The executables are placed on the nodes
sited by the SLURM_NODEID env var. Starting at 0 and going to the
number specified on the srun commandline.
> cat test.sh
case $SLURM_NODEID in
0) echo "I am running on "
hostname ;;
1) hostname
echo "is where I am running" ;;
esac
> srun -N2 test.sh
dev0
is where I am running
I am running on
dev1
This example demonstrates use of multi-core options to control layout
of tasks. We request that four sockets per node and two cores per
socket be dedicated to the job.
> srun -N2 -B 4-4:2-2 a.out
This example shows a script in which Slurm is used to provide resource
management for a job by executing the various job steps as processors
become available for their dedicated use.
> cat my.script
#!/bin/bash
srun --exclusive -n4 prog1 &
srun --exclusive -n3 prog2 &
srun --exclusive -n1 prog3 &
srun --exclusive -n1 prog4 &
wait
Copyright (C) 2006-2007 The Regents of the University of California. Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). Copyright (C) 2008-2010 Lawrence Livermore National Security. Copyright (C) 2010-2015 SchedMD LLC. This file is part of Slurm, a resource management program. For details, see <http://slurm.schedmd.com/>. Slurm is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
salloc(1), sattach(1), sbatch(1), sbcast(1), scancel(1), scontrol(1), squeue(1), slurm.conf(5), sched_setaffinity (2), numa (3) getrlimit (2)
Personal Opportunity - Free software gives you access to billions of dollars of software at no cost. Use this software for your business, personal use or to develop a profitable skill. Access to source code provides access to a level of capabilities/information that companies protect though copyrights. Open source is a core component of the Internet and it is available to you. Leverage the billions of dollars in resources and capabilities to build a career, establish a business or change the world. The potential is endless for those who understand the opportunity.
Business Opportunity - Goldman Sachs, IBM and countless large corporations are leveraging open source to reduce costs, develop products and increase their bottom lines. Learn what these companies know about open source and how open source can give you the advantage.
Free Software provides computer programs and capabilities at no cost but more importantly, it provides the freedom to run, edit, contribute to, and share the software. The importance of free software is a matter of access, not price. Software at no cost is a benefit but ownership rights to the software and source code is far more significant.
Free Office Software - The Libre Office suite provides top desktop productivity tools for free. This includes, a word processor, spreadsheet, presentation engine, drawing and flowcharting, database and math applications. Libre Office is available for Linux or Windows.
The Free Books Library is a collection of thousands of the most popular public domain books in an online readable format. The collection includes great classical literature and more recent works where the U.S. copyright has expired. These books are yours to read and use without restrictions.
Source Code - Want to change a program or know how it works? Open Source provides the source code for its programs so that anyone can use, modify or learn how to write those programs themselves. Visit the GNU source code repositories to download the source.
Study at Harvard, Stanford or MIT - Open edX provides free online courses from Harvard, MIT, Columbia, UC Berkeley and other top Universities. Hundreds of courses for almost all major subjects and course levels. Open edx also offers some paid courses and selected certifications.
Linux Manual Pages - A man or manual page is a form of software documentation found on Linux/Unix operating systems. Topics covered include computer programs (including library and system calls), formal standards and conventions, and even abstract concepts.