slurm.conf - Slurm configuration file
slurm.conf is an ASCII file which describes general Slurm configuration information, the nodes to be managed, information about how those nodes are grouped into partitions, and various scheduling parameters associated with those partitions. This file should be consistent across all nodes in the cluster. The file location can be modified at system build time using the DEFAULT_SLURM_CONF parameter or at execution time by setting the SLURM_CONF environment variable. The Slurm daemons also allow you to override both the built-in and environment-provided location using the "-f" option on the command line. The contents of the file are case insensitive except for the names of nodes and partitions. Any text following a "#" in the configuration file is treated as a comment through the end of that line. Changes to the configuration file take effect upon restart of Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the command "scontrol reconfigure" unless otherwise noted. If a line begins with the word "Include" followed by whitespace and then a file name, that file will be included inline with the current configuration file. For large or complex systems, multiple configuration files may prove easier to manage and enable reuse of some files (See INCLUDE MODIFIERS for more details). Note on file permissions: The slurm.conf file must be readable by all users of Slurm, since it is used by many of the Slurm commands. Other files that are defined in the slurm.conf file, such as log files and job accounting files, may need to be created/owned by the user "SlurmUser" to be successfully accessed. Use the "chown" and "chmod" commands to set the ownership and permissions appropriately. See the section FILE AND DIRECTORY PERMISSIONS for information about the various files and directories used by Slurm.
The overall configuration parameters available include:
AccountingStorageBackupHost
The name of the backup machine hosting the accounting storage
database. If used with the accounting_storage/slurmdbd plugin,
this is where the backup slurmdbd would be running. Only used
for database type storage plugins, ignored otherwise.
AccountingStorageEnforce
This controls what level of association-based enforcement to
impose on job submissions. Valid options are any combination of
associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
all for all things (expect nojobs and nosteps, they must be
requested as well).
If limits, qos, or wckeys are set, associations will
automatically be set.
If wckeys is set, TrackWCKey will automatically be set.
If safe is set, limits and associations will automatically be
set.
If nojobs is set nosteps will automatically be set.
By enforcing Associations no new job is allowed to run unless a
corresponding association exists in the system. If limits are
enforced users can be limited by association to whatever job
size or run time limits are defined.
If nojobs is set Slurm will not account for any jobs or steps on
the system, like wise if nosteps is set Slurm will not account
for any steps ran limits will still be enforced.
If safe is enforced a job will only be launched against an
association or qos that has a GrpCPUMins limit set if the job
will be able to run to completion. Without this option set,
jobs will be launched as long as their usage hasn't reached the
cpu-minutes limit which can lead to jobs being launched but then
killed when the limit is reached.
With qos and/or wckeys enforced jobs will not be scheduled
unless a valid qos and/or workload characterization key is
specified.
When AccountingStorageEnforce is changed, a restart of the
slurmctld daemon is required (not just a "scontrol reconfig").
AccountingStorageHost
The name of the machine hosting the accounting storage database.
Only used for database type storage plugins, ignored otherwise.
Also see DefaultStorageHost.
AccountingStorageLoc
The fully qualified file name where accounting records are
written when the AccountingStorageType is
"accounting_storage/filetxt" or else the name of the database
where accounting records are stored when the
AccountingStorageType is a database. Also see
DefaultStorageLoc.
AccountingStoragePass
The password used to gain access to the database to store the
accounting data. Only used for database type storage plugins,
ignored otherwise. In the case of Slurm DBD (Database Daemon)
with MUNGE authentication this can be configured to use a MUNGE
daemon specifically configured to provide authentication between
clusters while the default MUNGE daemon provides authentication
within a cluster. In that case, AccountingStoragePass should
specify the named port to be used for communications with the
alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
The default value is NULL. Also see DefaultStoragePass.
AccountingStoragePort
The listening port of the accounting storage database server.
Only used for database type storage plugins, ignored otherwise.
Also see DefaultStoragePort.
AccountingStorageTRES
Comma separated list of resources you wish to track on the
cluster. These are the resources requested by the sbatch/srun
job when it is submitted. Currently this consists of any GRES,
BB (burst buffer) or license along with CPU, Memory, Node, and
Energy. By default CPU, Energy, Memory, and Node are tracked.
AccountingStorageTRES=gres/craynetwork,license/iop1 will track
cpu, energy, memory, nodes along with a gres called craynetwork
as well as a license called iop1. Whenever these resources are
used on the cluster they are recorded. The TRES are
automatically set up in the database on the start of the
slurmctld.
AccountingStorageType
The accounting storage mechanism type. Acceptable values at
present include "accounting_storage/filetxt",
"accounting_storage/mysql", "accounting_storage/none" and
"accounting_storage/slurmdbd". The "accounting_storage/filetxt"
value indicates that accounting records will be written to the
file specified by the AccountingStorageLoc parameter. The
"accounting_storage/mysql" value indicates that accounting
records will be written to a MySQL or MariaDB database specified
by the AccountingStorageLoc parameter. The
"accounting_storage/slurmdbd" value indicates that accounting
records will be written to the Slurm DBD, which manages an
underlying MySQL database. See "man slurmdbd" for more
information. The default value is "accounting_storage/none" and
indicates that account records are not maintained. Note: The
filetxt plugin records only a limited subset of accounting
information and will prevent some sacct options from proper
operation. Also see DefaultStorageType.
AccountingStorageUser
The user account for accessing the accounting storage database.
Only used for database type storage plugins, ignored otherwise.
Also see DefaultStorageUser.
AccountingStoreJobComment
If set to "YES" then include the job's comment field in the job
complete message sent to the Accounting Storage database. The
default is "YES".
AcctGatherNodeFreq
The AcctGather plugins sampling interval for node accounting.
For AcctGather plugin values of none, this parameter is ignored.
For all other values this parameter is the number of seconds
between node accounting samples. For the acct_gather_energy/rapl
plugin, set a value less than 300 because the counters may
overflow beyond this rate. The default value is zero. This
value disables accounting sampling for nodes. Note: The
accounting sampling interval for jobs is determined by the value
of JobAcctGatherFrequency.
AcctGatherEnergyType
Identifies the plugin to be used for energy consumption
accounting. The jobacct_gather plugin and slurmd daemon call
this plugin to collect energy consumption data for jobs and
nodes. The collection of energy consumption data takes place on
node level, hence only in case of exclusive job allocation the
energy consumption measurements will reflect the jobs real
consumption. In case of node sharing between jobs the reported
consumed energy per job (through sstat or sacct) will not
reflect the real energy consumed by the jobs.
Configurable values at present are:
acct_gather_energy/none
No energy consumption data is collected.
acct_gather_energy/ipmi
Energy consumption data is collected from
the Baseboard Management Controller (BMC)
using the Intelligent Platform Management
Interface (IPMI).
acct_gather_energy/rapl
Energy consumption data is collected from
hardware sensors using the Running Average
Power Limit (RAPL) mechanism. Note that
enabling RAPL may require the execution of
the command "sudo modprobe msr".
AcctGatherInfinibandType
Identifies the plugin to be used for infiniband network traffic
accounting. The plugin is activated only when profiling on hdf5
files is activated and the user asks for network data collection
for jobs through --profile=Network (or =All). The collection of
network traffic data takes place on node level, hence only in
case of exclusive job allocation the collected values will
reflect the jobs real traffic. All network traffic data are
logged on hdf5 files per job on each node. No storage on the
Slurm database takes place.
Configurable values at present are:
acct_gather_infiniband/none
No infiniband network data are collected.
acct_gather_infiniband/ofed
Infiniband network traffic data are
collected from the hardware monitoring
counters of Infiniband devices through the
OFED library.
AcctGatherFilesystemType
Identifies the plugin to be used for filesystem traffic
accounting. The plugin is activated only when profiling on hdf5
files is activated and the user asks for filesystem data
collection for jobs through --profile=Lustre (or =All). The
collection of filesystem traffic data takes place on node level,
hence only in case of exclusive job allocation the collected
values will reflect the jobs real traffic. All filesystem
traffic data are logged on hdf5 files per job on each node. No
storage on the Slurm database takes place.
Configurable values at present are:
acct_gather_filesystem/none
No filesystem data are collected.
acct_gather_filesystem/lustre
Lustre filesystem traffic data are collected
from the counters found in /proc/fs/lustre/.
AcctGatherProfileType
Identifies the plugin to be used for detailed job profiling.
The jobacct_gather plugin and slurmd daemon call this plugin to
collect detailed data such as I/O counts, memory usage, or
energy consumption for jobs and nodes. There are interfaces in
this plugin to collect data as step start and completion, task
start and completion, and at the account gather frequency. The
data collected at the node level is related to jobs only in case
of exclusive job allocation.
Configurable values at present are:
acct_gather_profile/none
No profile data is collected.
acct_gather_profile/hdf5
This enables the HDF5 plugin. The directory
where the profile files are stored and which
values are collected are configured in the
acct_gather.conf file.
AllowSpecResourcesUsage
If set to 1, Slurm allows individual jobs to override node's
configured CoreSpecCount value. For a job to take advantage of
this feature, a command line option of --core-spec must be
specified. The default value for this option is 1 for Cray
systems and 0 for other system types.
AuthInfo
Additional information to be used for authentication of
communications between the Slurm daemons (slurmctld and slurmd)
and the Slurm clients. The interpretation of this option is
specific to the configured AuthType. Multiple options may be
specified in a comma delimited list. If not specified, the
default authentication information will be used.
cred_expire Default job step credential lifetime, in seconds
(e.g. "cred_expire=1200"). It must be
sufficiently long enough to load user environment,
run prolog, deal with the slurmd getting paged out
of memory, etc. This also controls how long a
requeued job must wait before starting again. The
default value is 120 seconds.
socket Path name to a MUNGE daemon socket to use (e.g.
"socket=/var/run/munge/munge.socket.2"). The
default value is "/var/run/munge/munge.socket.2".
Used by auth/munge and crypto/munge.
ttl Credential lifetime, in seconds (e.g. "ttl=300").
The default value is dependent upon the Munge
installation, but is typically 300 seconds.
AuthType
The authentication method for communications between Slurm
components. Acceptable values at present include "auth/none"
and "auth/munge". The default value is "auth/munge".
"auth/none" includes the UID in each communication, but it is
not verified. This may be fine for testing purposes, but do not
use "auth/none" if you desire any security. "auth/munge"
indicates that LLNL's MUNGE is to be used (this is the best
supported authentication mechanism for Slurm, see
"http://munge.googlecode.com/" for more information). All Slurm
daemons and commands must be terminated prior to changing the
value of AuthType and later restarted (Slurm jobs can be
preserved).
BackupAddr
The name that BackupController should be referred to in
establishing a communications path. This name will be used as an
argument to the gethostbyname() function for identification. For
example, "elx0000" might be used to designate the Ethernet
address for node "lx0000". By default the BackupAddr will be
identical in value to BackupController.
BackupController
The short, or long, name of the machine where Slurm control
functions are to be executed in the event that ControlMachine
fails (i.e. the name returned by the command "hostname -s").
This node may also be used as a compute server if so desired. It
will come into service as a controller only upon the failure of
ControlMachine and will revert to a "standby" mode when the
ControlMachine becomes available once again.
The backup controller recovers state information from the
StateSaveLocation directory, which must be readable and writable
from both the primary and backup controllers. While not
essential, it is recommended that you specify a backup
controller. See the RELOCATING CONTROLLERS section if you
change this.
BatchStartTimeout
The maximum time (in seconds) that a batch job is permitted for
launching before being considered missing and releasing the
allocation. The default value is 10 (seconds). Larger values may
be required if more time is required to execute the Prolog, load
user environment variables (for Moab spawned jobs), or if the
slurmd daemon gets paged from memory.
Note: The test for a job being successfully launched is only
performed when the Slurm daemon on the compute node registers
state with the slurmctld daemon on the head node, which happens
fairly rarely. Therefore a job will not necessarily be
terminated if its start time exceeds BatchStartTimeout. This
configuration parameter is also applied to launch tasks and
avoid aborting srun commands due to long running Prolog scripts.
BurstBufferType
The plugin used to manage burst buffers. Acceptable values at
present include "burst_buffer/none". More information later...
CheckpointType
The system-initiated checkpoint method to be used for user jobs.
The slurmctld daemon must be restarted for a change in
CheckpointType to take effect. Supported values presently
include:
checkpoint/aix for IBM AIX systems only
checkpoint/blcr Berkeley Lab Checkpoint Restart (BLCR). NOTE:
If a file is found at sbin/scch (relative to
the Slurm installation location), it will be
executed upon completion of the checkpoint.
This can be a script used for managing the
checkpoint files. NOTE: Slurm's BLCR logic
only supports batch jobs.
checkpoint/none no checkpoint support (default)
checkpoint/ompi OpenMPI (version 1.3 or higher)
checkpoint/poe for use with IBM POE (Parallel Operating
Environment) only
ChosLoc
If configured, then any processes invoked on the user behalf
(namely the SPANK prolog/epilog scripts and the slurmstepd
processes, which in turn spawn the user batch script and
applications) are not directly executed by the slurmd daemon,
but instead the ChosLoc program is executed. Both are spawned
with the same user ID as the configured SlurmdUser (typically
user root). That program's argument are the program and
arguments that would otherwise be invoked directly by the slurmd
daemon. The intent of this feature is to be able to run a user
application in some sort of container. This option specified
the fully qualified pathname of the chos command (see
https://github.com/scanon/chos for details).
ClusterName
The name by which this Slurm managed cluster is known in the
accounting database. This is needed distinguish accounting
records when multiple clusters report to the same database.
Because of limitations in some databases, any upper case letters
in the name will be silently mapped to lower case. In order to
avoid confusion, it is recommended that the name be lower case.
CompleteWait
The time, in seconds, given for a job to remain in COMPLETING
state before any additional jobs are scheduled. If set to zero,
pending jobs will be started as soon as possible. Since a
COMPLETING job's resources are released for use by other jobs as
soon as the Epilog completes on each individual node, this can
result in very fragmented resource allocations. To provide jobs
with the minimum response time, a value of zero is recommended
(no waiting). To minimize fragmentation of resources, a value
equal to KillWait plus two is recommended. In that case,
setting KillWait to a small value may be beneficial. The
default value of CompleteWait is zero seconds. The value may
not exceed 65533.
ControlAddr
Name that ControlMachine should be referred to in establishing a
communications path. This name will be used as an argument to
the gethostbyname() function for identification. For example,
"elx0000" might be used to designate the Ethernet address for
node "lx0000". By default the ControlAddr will be identical in
value to ControlMachine.
ControlMachine
The short, or long, hostname of the machine where Slurm control
functions are executed (i.e. the name returned by the command
"hostname -s"). This value must be specified. In order to
support some high availability architectures, multiple hostnames
may be listed with comma separators and one ControlAddr must be
specified. The high availability system must insure that the
slurmctld daemon is running on only one of these hosts at a
time. See the RELOCATING CONTROLLERS section if you change
this.
CoreSpecPlugin
Identifies the plugins to be used for enforcement of core
specialization. The slurmd daemon must be restarted for a
change in CoreSpecPlugin to take effect. Acceptable values at
present include:
core_spec/cray used only for Cray systems
core_spec/none used for all other system types
CpuFreqDef
Default CPU frequency governor to use when running a job step if
it has not been explicitly set with the --cpu-freq option.
Acceptable values at present include:
Conservative attempts to use the Conservative CPU governor
OnDemand attempts to use the OnDemand CPU governor
Performance attempts to use the Performance CPU governor
PowerSave attempts to use the PowerSave CPU governor
There is no default value. If unset, no attempt to set the governor is
made if the --cpu-freq option has not been set.
CpuFreqGovernors
List of CPU frequency governors allowed to be set with the
salloc, sbatch, or srun option --cpu-freq. Acceptable values
at present include:
Conservative attempts to use the Conservative CPU governor
OnDemand attempts to use the OnDemand CPU governor (the
default value)
Performance attempts to use the Performance CPU governor (the
default value)
PowerSave attempts to use the PowerSave CPU governor
UserSpace attempts to use the UserSpace CPU governor
The default is OnDemand, Performance.
CryptoType
The cryptographic signature tool to be used in the creation of
job step credentials. The slurmctld daemon must be restarted
for a change in CryptoType to take effect. Acceptable values at
present include "crypto/munge" and "crypto/openssl". The
default value is "crypto/munge".
DebugFlags
Defines specific subsystems which should provide more detailed
event logging. Multiple subsystems can be specified with comma
separators. Most DebugFlags will result in verbose logging for
the identified subsystems and could impact performance. The
below DB_* flags are only useful when writing directly to the
database. If using the DBD put these debug flags in the
slurmdbd.conf. Valid subsystems available today (with more to
come) include:
Backfill Backfill scheduler details
BackfillMap Backfill scheduler to log a very verbose map of
reserved resources through time. Combine with
Backfill for a verbose and complete view of the
backfill scheduler's work.
BGBlockAlgo BlueGene block selection details
BGBlockAlgoDeep BlueGene block selection, more details
BGBlockPick BlueGene block selection for jobs
BGBlockWires BlueGene block wiring (switch state details)
BurstBuffer Burst Buffer plugin
CPU_Bind CPU binding details for jobs and steps
CpuFrequency Cpu frequency details for jobs and steps using
the --cpu-freq option.
DB_ASSOC SQL statements/queries when dealing with
associations in the database.
DB_EVENT SQL statements/queries when dealing with (node)
events in the database.
DB_JOB SQL statements/queries when dealing with jobs
in the database.
DB_QOS SQL statements/queries when dealing with QOS in
the database.
DB_QUERY SQL statements/queries when dealing with
transactions and such in the database.
DB_RESERVATION SQL statements/queries when dealing with
reservations in the database.
DB_RESOURCE SQL statements/queries when dealing with
resources like licenses in the database.
DB_STEP SQL statements/queries when dealing with steps
in the database.
DB_USAGE SQL statements/queries when dealing with usage
queries and inserts in the database.
DB_WCKEY SQL statements/queries when dealing with wckeys
in the database.
Elasticsearch Elasticsearch debug info
Energy AcctGatherEnergy debug info
ExtSensors External Sensors debug info
FrontEnd Front end node details
Gres Generic resource details
Gang Gang scheduling details
JobContainer Job container plugin details
License License management details
NodeFeatures Node Features plugin debug info
NO_CONF_HASH Do not log when the slurm.conf files differs
between Slurm daemons
Power Power management plugin
Priority Job prioritization
Protocol Communication protocol details
Reservation Advanced reservations
SelectType Resource selection plugin
Steps Slurmctld resource allocation for job steps
Switch Switch plugin
TraceJobs Trace jobs in slurmctld. It will print detailed
job information including state, job ids and
allocated nodes counter.
Triggers Slurmctld triggers
Wiki Sched/wiki and wiki2 communications
DefMemPerCPU
Default real memory size available per allocated CPU in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. DefMemPerCPU would generally be used if individual
processors are allocated to jobs (SelectType=select/cons_res).
The default value is 0 (unlimited). Also see DefMemPerNode and
MaxMemPerCPU. DefMemPerCPU and DefMemPerNode are mutually
exclusive.
NOTE: Enforcement of memory limits currently requires enabling
of accounting, which samples memory use on a periodic basis
(data need not be stored, just collected).
DefMemPerNode
Default real memory size available per allocated node in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. DefMemPerNode would generally be used if whole nodes
are allocated to jobs (SelectType=select/linear) and resources
are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
The default value is 0 (unlimited). Also see DefMemPerCPU and
MaxMemPerNode. DefMemPerCPU and DefMemPerNode are mutually
exclusive.
NOTE: Enforcement of memory limits currently requires enabling
of accounting, which samples memory use on a periodic basis
(data need not be stored, just collected).
DefaultStorageHost
The default name of the machine hosting the accounting storage
and job completion databases. Only used for database type
storage plugins and when the AccountingStorageHost and
JobCompHost have not been defined.
DefaultStorageLoc
The fully qualified file name where accounting records and/or
job completion records are written when the DefaultStorageType
is "filetxt" or the name of the database where accounting
records and/or job completion records are stored when the
DefaultStorageType is a database. Also see AccountingStorageLoc
and JobCompLoc.
DefaultStoragePass
The password used to gain access to the database to store the
accounting and job completion data. Only used for database type
storage plugins, ignored otherwise. Also see
AccountingStoragePass and JobCompPass.
DefaultStoragePort
The listening port of the accounting storage and/or job
completion database server. Only used for database type storage
plugins, ignored otherwise. Also see AccountingStoragePort and
JobCompPort.
DefaultStorageType
The accounting and job completion storage mechanism type.
Acceptable values at present include "filetxt", "mysql" and
"none". The value "filetxt" indicates that records will be
written to a file. The value "mysql" indicates that accounting
records will be written to a MySQL or MariaDB database. The
default value is "none", which means that records are not
maintained. Also see AccountingStorageType and JobCompType.
DefaultStorageUser
The user account for accessing the accounting storage and/or job
completion database. Only used for database type storage
plugins, ignored otherwise. Also see AccountingStorageUser and
JobCompUser.
DisableRootJobs
If set to "YES" then user root will be prevented from running
any jobs. The default value is "NO", meaning user root will be
able to execute jobs. DisableRootJobs may also be set by
partition.
EioTimeout
The number of seconds srun waits for slurmstepd to close the
TCP/IP connection used to relay data between the user
application and srun when the user application terminates. The
default value is 60 seconds. May not exceed 65533.
EnforcePartLimits
If set to "ALL" then jobs which exceed a partition's size and/or
time limits will be rejected at submission time. If job is
submitted to multiple partitions, the job must satisfy the
limits on all the requested paritions. If set to "NO" then the
job will be accepted and remain queued until the partition
limits are altered(Time and Node Limits). If set to "ANY" or
"YES" a job must satisfy any of the requested partitions to be
submitted. The default value is "NO". NOTE: If set, then a
job's QOS can not be used to exceed partition limits.
Epilog Fully qualified pathname of a script to execute as user root on
every node when a user's job completes (e.g.
"/usr/local/slurm/epilog"). A glob pattern (See glob (7)) may
also be used to run more than one epilog script (e.g.
"/etc/slurm/epilog.d/*"). The Epilog script or scripts may be
used to purge files, disable user login, etc. By default there
is no epilog. See Prolog and Epilog Scripts for more
information.
EpilogMsgTime
The number of microseconds that the slurmctld daemon requires to
process an epilog completion message from the slurmd daemons.
This parameter can be used to prevent a burst of epilog
completion messages from being sent at the same time which
should help prevent lost messages and improve throughput for
large jobs. The default value is 2000 microseconds. For a 1000
node job, this spreads the epilog completion messages out over
two seconds.
EpilogSlurmctld
Fully qualified pathname of a program for the slurmctld to
execute upon termination of a job allocation (e.g.
"/usr/local/slurm/epilog_controller"). The program executes as
SlurmUser, which gives it permission to drain nodes and requeue
the job if a failure occurs (See scontrol(1)). Exactly what the
program does and how it accomplishes this is completely at the
discretion of the system administrator. Information about the
job being initiated, it's allocated nodes, etc. are passed to
the program using environment variables. See Prolog and Epilog
Scripts for more information.
ExtSensorsFreq
The external sensors plugin sampling interval. If
ExtSensorsType=ext_sensors/none, this parameter is ignored. For
all other values of ExtSensorsType, this parameter is the number
of seconds between external sensors samples for hardware
components (nodes, switches, etc.) The default value is zero.
This value disables external sensors sampling. Note: This
parameter does not affect external sensors data collection for
jobs/steps.
ExtSensorsType
Identifies the plugin to be used for external sensors data
collection. Slurmctld calls this plugin to collect external
sensors data for jobs/steps and hardware components. In case of
node sharing between jobs the reported values per job/step
(through sstat or sacct) may not be accurate. See also "man
ext_sensors.conf".
Configurable values at present are:
ext_sensors/none No external sensors data is collected.
ext_sensors/rrd External sensors data is collected from the
RRD database.
FairShareDampeningFactor
Dampen the effect of exceeding a user or group's fair share of
allocated resources. Higher values will provides greater ability
to differentiate between exceeding the fair share at high levels
(e.g. a value of 1 results in almost no difference between
overconsumption by a factor of 10 and 100, while a value of 5
will result in a significant difference in priority). The
default value is 1.
FastSchedule
Controls how a node's configuration specifications in slurm.conf
are used. If the number of node configuration entries in the
configuration file is significantly lower than the number of
nodes, setting FastSchedule to 1 will permit much faster
scheduling decisions to be made. (The scheduler can just check
the values in a few configuration records instead of possibly
thousands of node records.) Note that on systems with
hyper-threading, the processor count reported by the node will
be twice the actual processor count. Consider which value you
want to be used for scheduling purposes.
0 Base scheduling decisions upon the actual configuration of
each individual node except that the node's processor count
in Slurm's configuration must match the actual hardware
configuration if PreemptMode=suspend,gang or
SelectType=select/cons_res are configured (both of those
plugins maintain resource allocation information using
bitmaps for the cores in the system and must remain static,
while the node's memory and disk space can be established
later).
1 (default)
Consider the configuration of each node to be that
specified in the slurm.conf configuration file and any node
with less than the configured resources will be set to
DRAIN.
2 Consider the configuration of each node to be that
specified in the slurm.conf configuration file and any node
with less than the configured resources will not be set
DRAIN. This option is generally only useful for testing
purposes.
FirstJobId
The job id to be used for the first submitted to Slurm without a
specific requested value. Job id values generated will
incremented by 1 for each subsequent job. This may be used to
provide a meta-scheduler with a job id space which is disjoint
from the interactive jobs. The default value is 1. Also see
MaxJobId
GetEnvTimeout
Used for Moab scheduled jobs only. Controls how long job should
wait in seconds for loading the user's environment before
attempting to load it from a cache file. Applies when the srun
or sbatch --get-user-env option is used. If set to 0 then always
load the user's environment from the cache file. The default
value is 2 seconds.
GresTypes
A comma delimited list of generic resources to be managed.
These generic resources may have an associated plugin available
to provide additional functionality. No generic resources are
managed by default. Insure this parameter is consistent across
all nodes in the cluster for proper operation. The slurmctld
daemon must be restarted for changes to this parameter to become
effective.
GroupUpdateForce
If set to a non-zero value, then information about which users
are members of groups allowed to use a partition will be updated
periodically, even when there have been no changes to the
/etc/group file. Otherwise group member information will be
updated periodically only after the /etc/group file is updated
The default value is 1. Also see the GroupUpdateTime parameter.
GroupUpdateTime
Controls how frequently information about which users are
members of groups allowed to use a partition will be updated,
and how long user group membership lists will be cached. The
time interval is given in seconds with a default value of 600
seconds and a maximum value of 4095 seconds. A value of zero
will prevent periodic updating of group membership information.
Also see the GroupUpdateForce parameter.
HealthCheckInterval
The interval in seconds between executions of
HealthCheckProgram. The default value is zero, which disables
execution.
HealthCheckNodeState
Identify what node states should execute the HealthCheckProgram.
Multiple state values may be specified with a comma separator.
The default value is ANY to execute on nodes in any state.
ALLOC Run on nodes in the ALLOC state (all CPUs
allocated).
ANY Run on nodes in any state.
CYCLE Rather than running the health check program on all
nodes at the same time, cycle through running on all
compute nodes through the course of the
HealthCheckInterval. May be combined with the
various node state options.
IDLE Run on nodes in the IDLE state.
MIXED Run on nodes in the MIXED state (some CPUs idle and
other CPUs allocated).
HealthCheckProgram
Fully qualified pathname of a script to execute as user root
periodically on all compute nodes that are not in the
NOT_RESPONDING state. This program may be used to verify the
node is fully operational and DRAIN the node or send email if a
problem is detected. Any action to be taken must be explicitly
performed by the program (e.g. execute "scontrol update
NodeName=foo State=drain Reason=tmp_file_system_full" to drain a
node). The execution interval is controlled using the
HealthCheckInterval parameter. Note that the HealthCheckProgram
will be executed at the same time on all nodes to minimize its
impact upon parallel programs. This program is will be killed
if it does not terminate normally within 60 seconds. By
default, no program will be executed.
InactiveLimit
The interval, in seconds, after which a non-responsive job
allocation command (e.g. srun or salloc) will result in the job
being terminated. If the node on which the command is executed
fails or the command abnormally terminates, this will terminate
its job allocation. This option has no effect upon batch jobs.
When setting a value, take into consideration that a debugger
using srun to launch an application may leave the srun command
in a stopped state for extended periods of time. This limit is
ignored for jobs running in partitions with the RootOnly flag
set (the scheduler running as root will be responsible for the
job). The default value is unlimited (zero) and may not exceed
65533 seconds.
JobAcctGatherType
The job accounting mechanism type. Acceptable values at present
include "jobacct_gather/aix" (for AIX operating system),
"jobacct_gather/linux" (for Linux operating system),
"jobacct_gather/cgroup" and "jobacct_gather/none" (no accounting
data collected). The default value is "jobacct_gather/none".
"jobacct_gather/cgroup" is a plugin for the Linux operating
system that uses cgroups to collect accounting statistics. The
plugin collects the following statistics: From the cgroup memory
subsystem: memory.usage_in_bytes (reported as 'pages') and rss
from memory.stat (reported as 'rss'). From the cgroup cpuacct
subsystem: user cpu time and system cpu time. No value is
provided by cgroups for virtual memory size ('vsize'). In order
to use the sstat tool, "jobacct_gather/aix",
"jobacct_gather/linux", or "jobacct_gather/cgroup" must be
configured.
NOTE: Changing this configuration parameter changes the contents
of the messages between Slurm daemons. Any previously running
job steps are managed by a slurmstepd daemon that will persist
through the lifetime of that job step and not change it's
communication protocol. Only change this configuration parameter
when there are no running job steps.
JobAcctGatherFrequency
The job accounting and profiling sampling intervals. The
supported format is follows:
JobAcctGatherFrequency=<datatype>=<interval>
where <datatype>=<interval> specifies the task
sampling interval for the jobacct_gather plugin or a
sampling interval for a profiling type by the
acct_gather_profile plugin. Multiple, comma-
separated <datatype>=<interval> intervals may be
specified. Supported datatypes are as follows:
task=<interval>
where <interval> is the task sampling
interval in seconds for the jobacct_gather
plugins and for task profiling by the
acct_gather_profile plugin.
energy=<interval>
where <interval> is the sampling interval in
seconds for energy profiling using the
acct_gather_energy plugin
network=<interval>
where <interval> is the sampling interval in
seconds for infiniband profiling using the
acct_gather_infiniband plugin.
filesystem=<interval>
where <interval> is the sampling interval in
seconds for filesystem profiling using the
acct_gather_filesystem plugin.
The default value for task sampling interval
is 30 seconds. The default value for all other intervals is 0.
An interval of 0 disables sampling of the specified type. If
the task sampling interval is 0, accounting information is
collected only at job termination (reducing Slurm interference
with the job).
Smaller (non-zero) values have a greater impact upon job
performance, but a value of 30 seconds is not likely to be
noticeable for applications having less than 10,000 tasks.
Users can independently override each interval on a per job
basis using the --acctg-freq option when submitting the job.
JobAcctGatherParams
Arbitrary parameters for the job account gather plugin
Acceptable values at present include:
NoShared Exclude shared memory from accounting.
UsePss Use PSS value instead of RSS to calculate
real usage of memory. The PSS value will be
saved as RSS.
NoOverMemoryKill Do not kill process that uses more then
requested memory. This parameter should be
used with caution as if jobs exceeds its
memory allocation it may affect other
processes and/or machine health.
JobCheckpointDir
Specifies the default directory for storing or reading job
checkpoint information. The data stored here is only a few
thousand bytes per job and includes information needed to
resubmit the job request, not job's memory image. The directory
must be readable and writable by SlurmUser, but not writable by
regular users. The job memory images may be in a different
location as specified by --checkpoint-dir option at job submit
time or scontrol's ImageDir option.
JobCompHost
The name of the machine hosting the job completion database.
Only used for database type storage plugins, ignored otherwise.
Also see DefaultStorageHost.
JobCompLoc
The fully qualified file name where job completion records are
written when the JobCompType is "jobcomp/filetxt" or the
database where job completion records are stored when the
JobCompType is a database or an url with format
http://yourelasticserver:port where job completion records are
indexed when the JobCompType is "jobcomp/elasticsearch". Also
see DefaultStorageLoc.
JobCompPass
The password used to gain access to the database to store the
job completion data. Only used for database type storage
plugins, ignored otherwise. Also see DefaultStoragePass.
JobCompPort
The listening port of the job completion database server. Only
used for database type storage plugins, ignored otherwise. Also
see DefaultStoragePort.
JobCompType
The job completion logging mechanism type. Acceptable values at
present include "jobcomp/none", "jobcomp/elasticsearch",
"jobcomp/filetxt", "jobcomp/mysql" and "jobcomp/script"". The
default value is "jobcomp/none", which means that upon job
completion the record of the job is purged from the system. If
using the accounting infrastructure this plugin may not be of
interest since the information here is redundant. The value
"jobcomp/elasticsearch" indicates that a record of the job
should be written to an Elasticsearch server specified by the
JobCompLoc parameter. The value "jobcomp/filetxt" indicates
that a record of the job should be written to a text file
specified by the JobCompLoc parameter. The value
"jobcomp/mysql" indicates that a record of the job should be
written to a MySQL or MariaDB database specified by the
JobCompLoc parameter. The value "jobcomp/script" indicates that
a script specified by the JobCompLoc parameter is to be executed
with environment variables indicating the job information.
JobCompUser
The user account for accessing the job completion database.
Only used for database type storage plugins, ignored otherwise.
Also see DefaultStorageUser.
JobContainerType
Identifies the plugin to be used for job tracking. The slurmd
daemon must be restarted for a change in JobContainerType to
take effect. NOTE: The JobContainerType applies to a job
allocation, while ProctrackType applies to job steps.
Acceptable values at present include:
job_container/cncu used only for Cray systems (CNCU = Compute
Node Clean Up)
job_container/none used for all other system types
JobCredentialPrivateKey
Fully qualified pathname of a file containing a private key used
for authentication by Slurm daemons. This parameter is ignored
if CryptoType=crypto/munge.
JobCredentialPublicCertificate
Fully qualified pathname of a file containing a public key used
for authentication by Slurm daemons. This parameter is ignored
if CryptoType=crypto/munge.
JobFileAppend
This option controls what to do if a job's output or error file
exist when the job is started. If JobFileAppend is set to a
value of 1, then append to the existing file. By default, any
existing file is truncated.
JobRequeue
This option controls the default ability for batch jobs to be
requeued. Jobs may be requeued explicitly by a system
administrator, after node failure, or upon preemption by a
higher priority job. If JobRequeue is set to a value of 1, then
batch job may be requeued unless explicitly disabled by the
user. If JobRequeue is set to a value of 0, then batch job will
not be requeued unless explicitly enabled by the user. Use the
sbatch --no-requeue or --requeue option to change the default
behavior for individual jobs. The default value is 1.
JobSubmitPlugins
A comma delimited list of job submission plugins to be used.
The specified plugins will be executed in the order listed.
These are intended to be site-specific plugins which can be used
to set default job parameters and/or logging events. Sample
plugins available in the distribution include "all_partitions",
"defaults", "logging", "lua", and "partition". For examples of
use, see the Slurm code in "src/plugins/job_submit" and
"contribs/lua/job_submit*.lua" then modify the code to satisfy
your needs. Slurm can be configured to use multiple job_submit
plugins if desired, however the lua plugin will only execute one
lua script named "job_submit.lua" located in the default script
directory (typically the subdirectory "etc" of the installation
directory). No job submission plugins are used by default.
KeepAliveTime
Specifies how long sockets communications used between the srun
command and its slurmstepd process are kept alive after
disconnect. Longer values can be used to improve reliability of
communications in the event of network failures. The default
value leaves the system default value. The value may not exceed
65533.
KillOnBadExit
If set to 1, the job will be terminated immediately when one of
the processes is crashed or aborted. With the default value of
0, if one of the processes is crashed or aborted the other
processes will continue to run. The user can override this
configuration parameter by using srun's -K, --kill-on-bad-exit.
KillWait
The interval, in seconds, given to a job's processes between the
SIGTERM and SIGKILL signals upon reaching its time limit. If
the job fails to terminate gracefully in the interval specified,
it will be forcibly terminated. The default value is 30
seconds. The value may not exceed 65533.
NodeFeaturesPlugins
Identifies the plugins to be used for support of node features
which can change through time. For example, a node which might
be booted with various BIOS setting. This is supported through
the use of a node's active_features and available_features
information. Acceptable values at present include:
node_features/knl_cray
used only for Intel Knights Landing
processors (KNL) on Cray systems
LaunchParameters
Identifies options to the job launch plugin. Acceptable values
include:
test_exec Validate the executable command's existence prior to
attemping launch on the compute nodes
LaunchType
Identifies the mechanism to be used to launch application tasks.
Acceptable values include:
launch/aprun For use with Cray systems with ALPS and the
default value for those systems
launch/poe For use with IBM Parallel Environment (PE) and
the default value for systems with the IBM NRT
library installed.
launch/runjob For use with IBM BlueGene/Q systems and the
default value for those systems
launch/slurm For all other systems and the default value for
those systems
Licenses
Specification of licenses (or other resources available on all
nodes of the cluster) which can be allocated to jobs. License
names can optionally be followed by a colon and count with a
default count of one. Multiple license names should be comma
separated (e.g. "Licenses=foo:4,bar"). Note that Slurm
prevents jobs from being scheduled if their required license
specification is not available. Slurm does not prevent jobs
from using licenses that are not explicitly listed in the job
submission specification.
LogTimeFormat
Format of the timestamp in slurmctld and slurmd log files.
Accepted values are "iso8601", "iso8601_ms", "rfc5424",
"rfc5424_ms", "clock", "short" and "thread_id". The values
ending in "_ms" differ from the ones without in that fractional
seconds with millisecond precision are printed. The default
value is "iso8601_ms". The "rfc5424" formats are the same as the
"iso8601" formats except that the timezone value is also shown.
The "clock" format shows a timestamp in microseconds retrieved
with the C standard clock() function. The "short" format is a
short date and time format. The "thread_id" format shows the
timestamp in the C standard ctime() function form without the
year but including the microseconds, the daemon's process ID and
the current thread name and ID.
MailProg
Fully qualified pathname to the program used to send email per
user request. The default value is "/usr/bin/mail".
MaxArraySize
The maximum job array size. The maximum job array task index
value will be one less than MaxArraySize to allow for an index
value of zero. Configure MaxArraySize to 0 in order to disable
job array use. The value may not exceed 4000001. The value of
MaxJobCount should be much larger than MaxArraySize. The
default value is 1001.
MaxJobCount
The maximum number of jobs Slurm can have in its active database
at one time. Set the values of MaxJobCount and MinJobAge to
insure the slurmctld daemon does not exhaust its memory or other
resources. Once this limit is reached, requests to submit
additional jobs will fail. The default value is 10000 jobs.
NOTE: Each task of a job array counts as one job even though
they will not occupy separate job records until modified or
initiated. Performance can suffer with more than a few hundred
thousand jobs. Setting per MaxSubmitJobs per user is generally
valuable to prevent a single user from filling the system with
jobs. This is accomplished using Slurm's database and
configuring enforcement of resource limits. This value may not
be reset via "scontrol reconfig". It only takes effect upon
restart of the slurmctld daemon.
MaxJobId
The maximum job id to be used for jobs submitted to Slurm
without a specific requested value EXCEPT for jobs visible
between clusters. Job id values generated will incremented by 1
for each subsequent job. Once MaxJobId is reached, the next job
will be assigned FirstJobId. The default value is 2,147,418,112
(0x7fff0000). Jobs visible across clusters will always have a
job ID of 2,147,483,648 or higher. Also see FirstJobId.
MaxMemPerCPU
Maximum real memory size available per allocated CPU in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. MaxMemPerCPU would generally be used if individual
processors are allocated to jobs (SelectType=select/cons_res).
The default value is 0 (unlimited). Also see DefMemPerCPU and
MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutually
exclusive.
NOTE: Enforcement of memory limits currently requires enabling
of accounting, which samples memory use on a periodic basis
(data need not be stored, just collected).
NOTE: If a job specifies a memory per CPU limit that exceeds
this system limit, that job's count of CPUs per task will
automatically be increased. This may result in the job failing
due to CPU count limits.
MaxMemPerNode
Maximum real memory size available per allocated node in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. MaxMemPerNode would generally be used if whole nodes
are allocated to jobs (SelectType=select/linear) and resources
are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
The default value is 0 (unlimited). Also see DefMemPerNode and
MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually
exclusive.
NOTE: Enforcement of memory limits currently requires enabling
of accounting, which samples memory use on a periodic basis
(data need not be stored, just collected).
MaxStepCount
The maximum number of steps that any job can initiate. This
parameter is intended to limit the effect of bad batch scripts.
The default value is 40000 steps.
MaxTasksPerNode
Maximum number of tasks Slurm will allow a job step to spawn on
a single node. The default MaxTasksPerNode is 512. May not
exceed 65533.
MCSParameters
MCS = Multi-Category Security MCS Plugin Parameters. The
supported parameters are specific to the MCSPlugin. Changes to
this value take effect when the Slurm daemons are reconfigured.
More information about MCS is available here
<http://slurm.schedmd.com/mcs_Plugins.html>.
MCSPlugin
MCS = Multi-Category Security : associate a security label to
jobs and ensure that nodes can only be shared among jobs using
the same security label. Acceptable values include:
mcs/none is the default value. No security label associated
with jobs, no particular security restriction when
sharing nodes among jobs.
mcs/group only users with the same group can share the nodes.
mcs/user a node cannot be shared with other users.
MemLimitEnforce
If set to "no" then Slurm will not terminate the job or the job
step if they exceeds the value requested using the --mem-per-cpu
option of salloc/sbatch/srun. This is useful if jobs need to
specify --mem-per-cpu for scheduling but they should not be
terminate if they exceed the estimated value. The default value
is 'yes', terminate the job/step if exceed the requested memory.
MessageTimeout
Time permitted for a round-trip communication to complete in
seconds. Default value is 10 seconds. For systems with shared
nodes, the slurmd daemon could be paged out and necessitate
higher values.
MinJobAge
The minimum age of a completed job before its record is purged
from Slurm's active database. Set the values of MaxJobCount and
to insure the slurmctld daemon does not exhaust its memory or
other resources. The default value is 300 seconds. A value of
zero prevents any job record purging. In order to eliminate
some possible race conditions, the minimum non-zero value for
MinJobAge recommended is 2.
MpiDefault
Identifies the default type of MPI to be used. Srun may
override this configuration parameter in any case. Currently
supported versions include: lam, mpich1_p4, mpich1_shmem,
mpichgm, mpichmx, mvapich, none (default, which works for many
other versions of MPI) and openmpi. pmi2, More information
about MPI use is available here
<http://slurm.schedmd.com/mpi_guide.html>.
MpiParams
MPI parameters. Used to identify ports used by OpenMPI only and
the input format is "ports=12000-12999" to identify a range of
communication ports to be used.
MsgAggregationParams
Message aggregation parameters. Message aggregation is an
optional feature that may improve system performance by reducing
the number of separate messages passed between nodes. The
feature works by routing messages through one or more message
collector nodes between their source and destination nodes. At
each collector node, messages with the same destination received
during a defined message collection window are packaged into a
single composite message. When the window expires, the composite
message is sent to the next collector node on the route to its
destination. The route between each source and destination node
is provided by the Route plugin. When a composite message is
received at its destination node, the original messages are
extracted and processed as if they had been sent directly.
Currently, the only message types supported by message
aggregation are the node registration, batch script completion,
step completion, and epilog complete messages.
The format for this parameter is as follows:
MsgAggregationParams=<option>=<value>
where <option>=<value> specify a particular control
variable. Multiple, comma-separated <option>=<value>
pairs may be specified. Supported options are as
follows:
WindowMsgs=<number>
where <number> is the maximum number of
messages in each message collection window.
WindowTime=<time>
where <time> is the maximum elapsed time in
milliseconds of each message collection
window.
A window expires when either WindowMsgs or
WindowTime is
reached. By default, message aggregation is disabled. To enable
the feature, set WindowMsgs to a value greater than 1. The
default value for WindowTime is 100 milliseconds.
OverTimeLimit
Number of minutes by which a job can exceed its time limit
before being canceled. The configured job time limit is treated
as a soft limit. Adding OverTimeLimit to the soft limit
provides a hard limit, at which point the job is canceled. This
is particularly useful for backfill scheduling, which bases upon
each job's soft time limit. The default value is zero. May not
exceed exceed 65533 minutes. A value of "UNLIMITED" is also
supported.
PluginDir
Identifies the places in which to look for Slurm plugins. This
is a colon-separated list of directories, like the PATH
environment variable. The default value is
"/usr/local/lib/slurm".
PlugStackConfig
Location of the config file for Slurm stackable plugins that use
the Stackable Plugin Architecture for Node job (K)control
(SPANK). This provides support for a highly configurable set of
plugins to be called before and/or after execution of each task
spawned as part of a user's job step. Default location is
"plugstack.conf" in the same directory as the system slurm.conf.
For more information on SPANK plugins, see the spank(8) manual.
PowerParameters
System power management parameters. The supported parameters
are specific to the PowerPlugin. Changes to this value take
effect when the Slurm daemons are reconfigured. More
information about system power management is available here
<http://slurm.schedmd.com/power_mgmt.html>. Options current
supported by any plugins are listed below.
balance_interval=#
Specifies the time interval, in seconds, between attempts
to rebalance power caps across the nodes. This also
controls the frequency at which Slurm attempts to collect
current power consumption data (old data may be used
until new data is available from the underlying
infrastructure and values below 10 seconds are not
recommended for Cray systems). The default value is 30
seconds. Supported by the power/cray plugin.
capmc_path=
Specifies the absolute path of the capmc command. The
default value is "/opt/cray/capmc/default/bin/capmc".
Supported by the power/cray plugin.
cap_watts=#
Specifies the total power limit to be established across
all compute nodes managed by Slurm. A value of 0 sets
every compute node to have an unlimited cap. The default
value is 0. Supported by the power/cray plugin.
decrease_rate=#
Specifies the maximum rate of change in the power cap for
a node where the actual power usage is below the power
cap by an amount greater than lower_threshold (see
below). Value represents a percentage of the difference
between a node's minimum and maximum power consumption.
The default value is 50 percent. Supported by the
power/cray plugin.
get_timeout=#
Amount of time allowed to get power state information in
milliseconds. The default value is 5,000 milliseconds or
5 seconds. Supported by the power/cray plugin and
represents the time allowed for the capmc command to
respond to various "get" options.
increase_rate=#
Specifies the maximum rate of change in the power cap for
a node where the actual power usage is within
upper_threshold (see below) of the power cap. Value
represents a percentage of the difference between a
node's minimum and maximum power consumption. The
default value is 20 percent. Supported by the power/cray
plugin.
job_level
All nodes associated with every job will have the same
power cap, to the extent possible. Also see the
--power=level option on the job submission commands.
job_no_level
Disable the user's ability to set every node associated
with a job to the same power cap. Each node will have
it's power cap set independently. This disables the
--power=level option on the job submission commands.
lower_threshold=#
Specify a lower power consumption threshold. If a node's
current power consumption is below this percentage of its
current cap, then its power cap will be reduced. The
default value is 90 percent. Supported by the power/cray
plugin.
recent_job=#
If a job has started or resumed execution (from suspend)
on a compute node within this number of seconds from the
current time, the node's power cap will be increased to
the maximum. The default value is 300 seconds.
Supported by the power/cray plugin.
set_timeout=#
Amount of time allowed to set power state information in
milliseconds. The default value is 30,000 milliseconds
or 30 seconds. Supported by the power/cray plugin and
represents the time allowed for the capmc command to
respond to various "set" options.
set_watts=#
Specifies the power limit to be set on every compute
nodes managed by Slurm. Every node gets this same power
cap and there is no variation through time based upon
actual power usage on the node. Supported by the
power/cray plugin.
upper_threshold=#
Specify an upper power consumption threshold. If a
node's current power consumption is above this percentage
of its current cap, then its power cap will be increased
to the extent possible. The default value is 95 percent.
Supported by the power/cray plugin.
PowerPlugin
Identifies the plugin used for system power management.
Currently supported plugins include: cray and none. Changes to
this value require restarting Slurm daemons to take effect.
More information about system power management is available here
<http://slurm.schedmd.com/power_mgmt.html>. By default, no
power plugin is loaded.
PreemptMode
Enables gang scheduling and/or controls the mechanism used to
preempt jobs. When the PreemptType parameter is set to enable
preemption, the PreemptMode selects the default mechanism used
to preempt the lower priority jobs for the cluster. PreemptMode
may be specified on a per partition basis to override this
default value if PreemptType=preempt/partition_prio, but a valid
default PreemptMode value must be specified for the cluster as a
whole when preemption is enabled. The GANG option is used to
enable gang scheduling independent of whether preemption is
enabled (the PreemptType setting). The GANG option can be
specified in addition to a PreemptMode setting with the two
options comma separated. The SUSPEND option requires that gang
scheduling be enabled (i.e, "PreemptMode=SUSPEND,GANG").
OFF is the default value and disables job preemption and
gang scheduling. This is the only option compatible
with SchedulerType=sched/wiki or
SchedulerType=sched/wiki2 (used by Maui and Moab
respectively, which provide their own job preemption
functionality).
CANCEL always cancel the job.
CHECKPOINT preempts jobs by checkpointing them (if possible) or
canceling them.
GANG enables gang scheduling (time slicing) of jobs in
the same partition. NOTE: Gang scheduling is
performed independently for each partition, so
configuring partitions with overlapping nodes and
gang scheduling is generally not recommended.
REQUEUE preempts jobs by requeuing them (if possible) or
canceling them. For jobs to be requeued they must
have the --requeue sbatch option set or the cluster
wide JobRequeue parameter in slurm.conf must be set
to one.
SUSPEND If PreemptType=preempt/partition_prio is configured
then suspend and automatically resume the low
priority jobs. If PreemptType=preempt/qos is
configured, then the jobs sharing resources will
always time slice rather than one job remaining
suspended. The SUSPEND may only be used with the
GANG option (the gang scheduler module performs the
job resume operation).
PreemptType
This specifies the plugin used to identify which jobs can be
preempted in order to start a pending job.
preempt/none
Job preemption is disabled. This is the default.
preempt/partition_prio
Job preemption is based upon partition priority tier.
Jobs in higher priority partitions (queues) may preempt
jobs from lower priority partitions. This is not
compatible with PreemptMode=OFF.
preempt/qos
Job preemption rules are specified by Quality Of Service
(QOS) specifications in the Slurm database. This optioin
is not compatible with PreemptMode=OFF. A configuration
of PreemptMode=SUSPEND is only supported by the
select/cons_res plugin.
PriorityDecayHalfLife
This controls how long prior resource use is considered in
determining how over- or under-serviced an association is (user,
bank account and cluster) in determining job priority. The
record of usage will be decayed over time, with half of the
original value cleared at age PriorityDecayHalfLife. If set to
0 no decay will be applied. This is helpful if you want to
enforce hard time limits per association. If set to 0
PriorityUsageResetPeriod must be set to some interval.
Applicable only if PriorityType=priority/multifactor. The unit
is a time string (i.e. min, hr:min:00, days-hr:min:00, or
days-hr). The default value is 7-0 (7 days).
PriorityCalcPeriod
The period of time in minutes in which the half-life decay will
be re-calculated. Applicable only if
PriorityType=priority/multifactor. The default value is 5
(minutes).
PriorityFavorSmall
Specifies that small jobs should be given preferential
scheduling priority. Applicable only if
PriorityType=priority/multifactor. Supported values are "YES"
and "NO". The default value is "NO".
PriorityFlags
Flags to modify priority behavior Applicable only if
PriorityType=priority/multifactor. The keywords below have no
associated value (e.g.
"PriorityFlags=ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME").
ACCRUE_ALWAYS If set, priority age factor will be increased
despite job dependencies or holds.
CALCULATE_RUNNING
If set, priorities will be recalculated not
only for pending jobs, but also running and
suspended jobs.
FAIR_TREE If set, priority will be calculated in such a
way that if accounts A and B are siblings and A
has a higher fairshare factor than B, all
children of A will have higher fairshare
factors than all children of B.
DEPTH_OBLIVIOUS If set, priority will be calculated based
similar to the normal multifactor calculation,
but depth of the associations in the tree do
not adversely effect their priority.
SMALL_RELATIVE_TO_TIME
If set, the job's size component will be based
upon not the job size alone, but the job's size
divided by it's time limit.
PriorityParameters
Arbitrary string used by the PriorityType plugin.
PriorityMaxAge
Specifies the job age which will be given the maximum age factor
in computing priority. For example, a value of 30 minutes would
result in all jobs over 30 minutes old would get the same
age-based priority. Applicable only if
PriorityType=priority/multifactor. The unit is a time string
(i.e. min, hr:min:00, days-hr:min:00, or days-hr). The default
value is 7-0 (7 days).
PriorityUsageResetPeriod
At this interval the usage of associations will be reset to 0.
This is used if you want to enforce hard limits of time usage
per association. If PriorityDecayHalfLife is set to be 0 no
decay will happen and this is the only way to reset the usage
accumulated by running jobs. By default this is turned off and
it is advised to use the PriorityDecayHalfLife option to avoid
not having anything running on your cluster, but if your schema
is set up to only allow certain amounts of time on your system
this is the way to do it. Applicable only if
PriorityType=priority/multifactor.
NONE Never clear historic usage. The default value.
NOW Clear the historic usage now. Executed at startup
and reconfiguration time.
DAILY Cleared every day at midnight.
WEEKLY Cleared every week on Sunday at time 00:00.
MONTHLY Cleared on the first day of each month at time
00:00.
QUARTERLY Cleared on the first day of each quarter at time
00:00.
YEARLY Cleared on the first day of each year at time 00:00.
PriorityType
This specifies the plugin to be used in establishing a job's
scheduling priority. Supported values are "priority/basic" (jobs
are prioritized by order of arrival, also suitable for
sched/wiki and sched/wiki2), "priority/multifactor" (jobs are
prioritized based upon size, age, fair-share of allocation,
etc). Also see PriorityFlags for configuration options. The
default value is "priority/basic".
When not FIFO scheduling, jobs are prioritized in the following
order:
1. Jobs that can preempt
2. Jobs with an advanced reservation
3. Partition Priority Tier
4. Job Priority
5. Job Id
PriorityWeightAge
An integer value that sets the degree to which the queue wait
time component contributes to the job's priority. Applicable
only if PriorityType=priority/multifactor. The default value is
0.
PriorityWeightFairshare
An integer value that sets the degree to which the fair-share
component contributes to the job's priority. Applicable only if
PriorityType=priority/multifactor. The default value is 0.
PriorityWeightJobSize
An integer value that sets the degree to which the job size
component contributes to the job's priority. Applicable only if
PriorityType=priority/multifactor. The default value is 0.
PriorityWeightPartition
Partition factor used by priority/multifactor plugin in
calculating job priority. Applicable only if
PriorityType=priority/multifactor. The default value is 0.
PriorityWeightQOS
An integer value that sets the degree to which the Quality Of
Service component contributes to the job's priority. Applicable
only if PriorityType=priority/multifactor. The default value is
0.
PriorityWeightTRES
A comma separated list of TRES Types and weights that sets the
degree that each TRES Type contributes to the job's priority.
e.g.
PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
Applicable only if PriorityType=priority/multifactor and if
AccountingStorageTRES is configured with each TRES Type. The
default values are 0.
PrivateData
This controls what type of information is hidden from regular
users. By default, all information is visible to all users.
User SlurmUser and root can always view all information.
Multiple values may be specified with a comma separator.
Acceptable values include:
accounts
(NON-SlurmDBD ACCOUNTING ONLY) Prevents users from
viewing any account definitions unless they are
coordinators of them.
cloud Powered down nodes in the cloud are visible.
jobs Prevents users from viewing jobs or job steps belonging
to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
users from viewing job records belonging to other users
unless they are coordinators of the association running
the job when using sacct.
nodes Prevents users from viewing node state information.
partitions
Prevents users from viewing partition state information.
reservations
Prevents regular users from viewing reservations which
they can not use.
usage Prevents users from viewing usage of any other user, this
applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY)
Prevents users from viewing usage of any other user, this
applies to sreport.
users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from
viewing information of any user other than themselves,
this also makes it so users can only see associations
they deal with. Coordinators can see associations of all
users they are coordinator of, but can only see
themselves when listing users.
ProctrackType
Identifies the plugin to be used for process tracking on a job
step basis. The slurmd daemon uses this mechanism to identify
all processes which are children of processes it spawns for a
user job step. The slurmd daemon must be restarted for a change
in ProctrackType to take effect. NOTE: "proctrack/linuxproc"
and "proctrack/pgid" can fail to identify all processes
associated with a job since processes can become a child of the
init process (when the parent process terminates) or change
their process group. To reliably track all processes, one of
the other mechanisms utilizing kernel modifications is
preferable. NOTE: The JobContainerType applies to a job
allocation, while ProctrackType applies to job steps.
Acceptable values at present include:
proctrack/aix which uses an AIX kernel extension and is
the default for AIX systems
proctrack/cgroup which uses linux cgroups to constrain and
track processes. NOTE: see "man
cgroup.conf" for configuration details NOTE:
This plugin writes to disk often and can
impact performance. If you are running lots
of short running jobs (less than a couple of
seconds) this plugin slows down performance
dramatically. It should probably be avoided
in an HTC environment.
proctrack/cray which uses Cray proprietary process tracking
proctrack/linuxproc which uses linux process tree using parent
process IDs
proctrack/lua which uses a site-specific LUA script to
track processes
proctrack/sgi_job which uses SGI's Process Aggregates (PAGG)
kernel module, see
http://oss.sgi.com/projects/pagg/ for more
information
proctrack/pgid which uses process group IDs and is the
default for all other systems
Prolog Fully qualified pathname of a program for the slurmd to execute
whenever it is asked to run a job step from a new job allocation
(e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob(7))
may also be used to specify more than one program to run (e.g.
"/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
starting the first job step. The prolog script or scripts may
be used to purge files, enable user login, etc. By default
there is no prolog. Any configured script is expected to
complete execution quickly (in less time than MessageTimeout).
If the prolog fails (returns a non-zero exit code), this will
result in the node being set to a DRAIN state and the job being
requeued in a held state, unless nohold_on_prolog_fail is
configured in SchedulerParameters. See Prolog and Epilog
Scripts for more information.
PrologEpilogTimeout
The interval in seconds Slurms waits for Prolog and Epilog
before terminating them. The default behavior is to wait
indefinitely. This interval applies to the Prolog and Epilog run
by slurmd daemon before and after the job, the PrologSlurmctld
and EpilogSlurmctld run by slurmctld daemon, and the SPANK
plugins run by the slurmstep daemon.
PrologFlags
Flags to control the Prolog behavior. By default no flags are
set. Multiple flags may be specified in a comma-separated list.
Currently supported options are:
Alloc If set, the Prolog script will be executed at job
allocation. By default, Prolog is executed just before
the task is launched. Therefore, when salloc is started,
no Prolog is executed. Alloc is useful for preparing
things before a user starts to use any allocated
resources. In particular, this flag is needed on a Cray
system when cluster compatibility mode is enabled.
NOTE: Use of the Alloc flag will increase the time
required to start jobs.
Contain At job allocation time, use the ProcTrack plugin to
create a job container on all allocated compute nodes.
This container may be used for user processes not
launched under Slurm control, for example the PAM module
may place processes launch through a direct user login
into this container. Setting the Contain implicitly
sets the Alloc flag.
NoHold If set, the Alloc flag should also be set. This will
allow for salloc to not block until the prolog is
finished on each node. The blocking will happen when
steps reach the slurmd and before any execution has
happened in the step. This is a much faster way to work
and if using srun to launch your tasks you should use
this flag.
PrologSlurmctld
Fully qualified pathname of a program for the slurmctld daemon
to execute before granting a new job allocation (e.g.
"/usr/local/slurm/prolog_controller"). The program executes as
SlurmUser on the same node where the slurmctld daemon executes,
giving it permission to drain nodes and requeue the job if a
failure occurs or cancel the job if appropriate. The program
can be used to reboot nodes or perform other work to prepare
resources for use. Exactly what the program does and how it
accomplishes this is completely at the discretion of the system
administrator. Information about the job being initiated, it's
allocated nodes, etc. are passed to the program using
environment variables. While this program is running, the nodes
associated with the job will be have a POWER_UP/CONFIGURING flag
set in their state, which can be readily viewed. The slurmctld
daemon will wait indefinitely for this program to complete.
Once the program completes with an exit code of zero, the nodes
will be considered ready for use and the program will be
started. If some node can not be made available for use, the
program should drain the node (typically using the scontrol
command) and terminate with a non-zero exit code. A non-zero
exit code will result in the job being requeued (where possible)
or killed. Note that only batch jobs can be requeued. See
Prolog and Epilog Scripts for more information.
PropagatePrioProcess
Controls the scheduling priority (nice value) of user spawned
tasks.
0 The tasks will inherit the scheduling priority from the
slurm daemon. This is the default value.
1 The tasks will inherit the scheduling priority of the
command used to submit them (e.g. srun or sbatch). Unless
the job is submitted by user root, the tasks will have a
scheduling priority no higher than the slurm daemon
spawning them.
2 The tasks will inherit the scheduling priority of the
command used to submit them (e.g. srun or sbatch) with the
restriction that their nice value will always be one higher
than the slurm daemon (i.e. the tasks scheduling priority
will be lower than the slurm daemon).
PropagateResourceLimits
A list of comma separated resource limit names. The slurmd
daemon uses these names to obtain the associated (soft) limit
values from the users process environment on the submit node.
These limits are then propagated and applied to the jobs that
will run on the compute nodes. This parameter can be useful
when system limits vary among nodes. Any resource limits that
do not appear in the list are not propagated. However, the user
can override this by specifying which resource limits to
propagate with the srun commands "--propagate" option. If
neither of the 'propagate resource limit' parameters are
specified, then the default action is to propagate all limits.
Only one of the parameters, either PropagateResourceLimits or
PropagateResourceLimitsExcept, may be specified. The user
limits can not exceed hard limits under which the slurmd daemon
operates. If the user limits are not propagated, the limits from
the slurmd daemon will be propagated to the user's job. The
limits used for the Slurm daemons can be set in the
/etc/sysconf/slurm file. For more information, see:
http://slurm.schedmd.com/faq.html#memlock The following limit
names are supported by Slurm (although some options may not be
supported on some systems):
ALL All limits listed below
NONE No limits listed below
AS The maximum address space for a process
CORE The maximum size of core file
CPU The maximum amount of CPU time
DATA The maximum size of a process's data segment
FSIZE The maximum size of files created. Note that if the
user sets FSIZE to less than the current size of the
slurmd.log, job launches will fail with a 'File size
limit exceeded' error.
MEMLOCK The maximum size that may be locked into memory
NOFILE The maximum number of open files
NPROC The maximum number of processes available
RSS The maximum resident set size
STACK The maximum stack size
PropagateResourceLimitsExcept
A list of comma separated resource limit names. By default, all
resource limits will be propagated, (as described by the
PropagateResourceLimits parameter), except for the limits
appearing in this list. The user can override this by
specifying which resource limits to propagate with the srun
commands "--propagate" option. See PropagateResourceLimits
above for a list of valid limit names.
RebootProgram
Program to be executed on each compute node to reboot it.
Invoked on each node once it becomes idle after the command
"scontrol reboot_nodes" is executed by an authorized user or a
job is submitted with the "--reboot" option. After being
rebooting, the node is returned to normal use. NOTE: This
configuration option does not apply to IBM BlueGene systems.
ReconfigFlags
Flags to control various actions that may be taken when an
"scontrol reconfig" command is issued. Currently the options
are:
KeepPartInfo If set, an "scontrol reconfig" command will
maintain the in-memory value of partition
"state" and other parameters that may have been
dynamically updated by "scontrol update".
Partition information in the slurm.conf file
will be merged with in-memory data. This flag
supersedes the KeepPartState flag.
KeepPartState If set, an "scontrol reconfig" command will
preserve only the current "state" value of
in-memory partitions and will reset all other
parameters of the partitions that may have been
dynamically updated by "scontrol update" to the
values from the slurm.conf file. Partition
information in the slurm.conf file will be
merged with in-memory data.
The default for the above flags is not set, and the "scontrol
reconfig" will rebuild the partition information using only the
definitions in the slurm.conf file.
RequeueExit
Enables automatic job requeue for jobs which exit with the
specified values. Separate multiple exit code by a comma and/or
specify numeric ranges using a "-" separator (e.g.
"RequeueExit=1-9,18") Jobs will be put back in to pending state
and later scheduled again. Restarted jobs will have the
environment variable SLURM_RESTART_COUNT set to the number of
times the job has been restarted.
RequeueExitHold
Enables automatic requeue of jobs into pending state in hold,
meaning their priority is zero. Separate multiple exit code by
a comma and/or specify numeric ranges using a "-" separator
(e.g. "RequeueExitHold=10-12,16") These jobs are put in the
JOB_SPECIAL_EXIT exit state. Restarted jobs will have the
environment variable SLURM_RESTART_COUNT set to the number of
times the job has been restarted.
ResumeProgram
Slurm supports a mechanism to reduce power consumption on nodes
that remain idle for an extended period of time. This is
typically accomplished by reducing voltage and frequency or
powering the node down. ResumeProgram is the program that will
be executed when a node in power save mode is assigned work to
perform. For reasons of reliability, ResumeProgram may execute
more than once for a node when the slurmctld daemon crashes and
is restarted. If ResumeProgram is unable to restore a node to
service, it should requeue any job associated with the node and
set the node state to DRAIN. The program executes as SlurmUser.
The argument to the program will be the names of nodes to be
removed from power savings mode (using Slurm's hostlist
expression format). By default no program is run. Related
configuration options include ResumeTimeout, ResumeRate,
SuspendRate, SuspendTime, SuspendTimeout, SuspendProgram,
SuspendExcNodes, and SuspendExcParts. More information is
available at the Slurm web site (
http://slurm.schedmd.com/power_save.html ).
ResumeRate
The rate at which nodes in power save mode are returned to
normal operation by ResumeProgram. The value is number of nodes
per minute and it can be used to prevent power surges if a large
number of nodes in power save mode are assigned work at the same
time (e.g. a large job starts). A value of zero results in no
limits being imposed. The default value is 300 nodes per
minute. Related configuration options include ResumeTimeout,
ResumeProgram, SuspendRate, SuspendTime, SuspendTimeout,
SuspendProgram, SuspendExcNodes, and SuspendExcParts.
ResumeTimeout
Maximum time permitted (in second) between when a node is resume
request is issued and when the node is actually available for
use. Nodes which fail to respond in this time frame may be
marked DOWN and the jobs scheduled on the node requeued. The
default value is 60 seconds. Related configuration options
include ResumeProgram, ResumeRate, SuspendRate, SuspendTime,
SuspendTimeout, SuspendProgram, SuspendExcNodes and
SuspendExcParts. More information is available at the Slurm web
site ( http://slurm.schedmd.com/power_save.html ).
ResvEpilog
Fully qualified pathname of a program for the slurmctld to
execute when a reservation ends. The program can be used to
cancel jobs, modify partition configuration, etc. The
reservation named will be passed as an argument to the program.
By default there is no epilog.
ResvOverRun
Describes how long a job already running in a reservation should
be permitted to execute after the end time of the reservation
has been reached. The time period is specified in minutes and
the default value is 0 (kill the job immediately). The value
may not exceed 65533 minutes, although a value of "UNLIMITED" is
supported to permit a job to run indefinitely after its
reservation is terminated.
ResvProlog
Fully qualified pathname of a program for the slurmctld to
execute when a reservation begins. The program can be used to
cancel jobs, modify partition configuration, etc. The
reservation named will be passed as an argument to the program.
By default there is no prolog.
ReturnToService
Controls when a DOWN node will be returned to service. The
default value is 0. Supported values include
0 A node will remain in the DOWN state until a system
administrator explicitly changes its state (even if the
slurmd daemon registers and resumes communications).
1 A DOWN node will become available for use upon registration
with a valid configuration only if it was set DOWN due to
being non-responsive. If the node was set DOWN for any
other reason (low memory, unexpected reboot, etc.), its
state will not automatically be changed. A node registers
with a valid configuration if its memory, GRES, CPU count,
etc. are equal to or greater than the values configured in
slurm.conf.
2 A DOWN node will become available for use upon registration
with a valid configuration. The node could have been set
DOWN for any reason. A node registers with a valid
configuration if its memory, GRES, CPU count, etc. are equal
to or greater than the values configured in slurm.conf.
(Disabled on Cray ALPS systems.)
RoutePlugin
Identifies the plugin to be used for defining which nodes will
be used for message forwarding and message aggregation.
route/default
default, use TreeWidth.
route/topology
use the switch hierarchy defined in a topology.conf file.
TopologyPlugin=topology/tree is required.
SallocDefaultCommand
Normally, salloc(1) will run the user's default shell when a
command to execute is not specified on the salloc command line.
If SallocDefaultCommand is specified, salloc will instead run
the configured command. The command is passed to '/bin/sh -c',
so shell metacharacters are allowed, and commands with multiple
arguments should be quoted. For instance:
SallocDefaultCommand = "$SHELL"
would run the shell in the user's $SHELL environment variable.
and
SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"
would run spawn the user's default shell on the allocated
resources, but not consume any of the CPU or memory resources,
configure it as a pseudo-terminal, and preserve all of the job's
environment variables (i.e. and not over-write them with the job
step's allocation information).
For systems with generic resources (GRES) defined, the
SallocDefaultCommand value should explicitly specify a zero
count for the configured GRES. Failure to do so will result in
the launched shell consuming those GRES and preventing
subsequent srun commands from using them. For example, on Cray
systems add "--gres=craynetwork:0" as shown below:
SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"
For systems with TaskPlugin set, adding an option of
"--cpu_bind=no" is recommended if the default shell should have
access to all of the CPUs allocated to the job on that node,
otherwise the shell may be limited to a single cpu or core.
SchedulerParameters
The interpretation of this parameter varies by SchedulerType.
Multiple options may be comma separated.
assoc_limit_stop
If set and a job cannot start due to association limits,
then do not attempt to initiate any lower priority jobs
in that partition. Setting this can decrease system
throughput and utlization, but avoid potentially starving
larger jobs by preventing them from launching
indefinitely.
batch_sched_delay=#
How long, in seconds, the scheduling of batch jobs can be
delayed. This can be useful in a high-throughput
environment in which batch jobs are submitted at a very
high rate (i.e. using the sbatch command) and one wishes
to reduce the overhead of attempting to schedule each job
at submit time. The default value is 3 seconds.
bf_busy_nodes
When selecting resources for pending jobs to reserve for
future execution (i.e. the job can not be started
immediately), then preferentially select nodes that are
in use. This will tend to leave currently idle resources
available for backfilling longer running jobs, but may
result in allocations having less than optimal network
topology. This option is currently only supported by the
select/cons_res plugin (or select/cray with
SelectTypeParameters set to "OTHER_CONS_RES", which
layers the select/cray plugin over the select/cons_res
plugin).
bf_continue
The backfill scheduler periodically releases locks in
order to permit other operations to proceed rather than
blocking all activity for what could be an extended
period of time. Setting this option will cause the
backfill scheduler to continue processing pending jobs
from its original job list after releasing locks even if
job or node state changes. This can result in lower
priority jobs being backfill scheduled instead of newly
arrived higher priority jobs, but will permit more queued
jobs to be considered for backfill scheduling.
bf_interval=#
The number of seconds between iterations. Higher values
result in less overhead and better responsiveness. The
backfill scheduler will start over after reaching this
time limit (including time spent sleeping), even if the
maximum job counts have not been reached. This option
applies only to SchedulerType=sched/backfill. The
default value is 30 seconds.
bf_max_job_array_resv=#
The maximum number of tasks from a job array for which to
reserve resources in the future. Since job arrays can
potentially have millions of tasks, the overhead in
reserving resources for all tasks can be prohibitive. In
addition various limits may prevent all the jobs from
starting at the expected times. This has no impact upon
the number of tasks from a job array that can be started
immediately, only those tasks expected to start at some
future time. The default value is 20 tasks.
bf_max_job_part=#
The maximum number of jobs per partition to attempt
starting with the backfill scheduler. This can be
especially helpful for systems with large numbers of
partitions and jobs. The default value is 0, which means
no limit. This option applies only to
SchedulerType=sched/backfill. Also see the
partition_job_depth and bf_max_job_test options. Set
bf_max_job_test to a value much higher than
bf_max_job_part.
bf_max_job_start=#
The maximum number of jobs which can be initiated in a
single iteration of the backfill scheduler. The default
value is 0, which means no limit. This option applies
only to SchedulerType=sched/backfill.
bf_max_job_test=#
The maximum number of jobs to attempt backfill scheduling
for (i.e. the queue depth). Higher values result in more
overhead and less responsiveness. Until an attempt is
made to backfill schedule a job, its expected initiation
time value will not be set. The default value is 100.
In the case of large clusters, configuring a relatively
small value may be desirable. This option applies only
to SchedulerType=sched/backfill.
bf_max_job_user=#
The maximum number of jobs per user to attempt starting
with the backfill scheduler. One can set this limit to
prevent users from flooding the backfill queue with jobs
that cannot start and that prevent jobs from other users
to start. This is similar to the MAXIJOB limit in Maui.
The default value is 0, which means no limit. This
option applies only to SchedulerType=sched/backfill.
Also see the bf_max_job_part and bf_max_job_test options.
Set bf_max_job_test to a value much higher than
bf_max_job_user.
bf_min_age_reserve=#
The backfill and main scheduling logic will not reserve
resources for pending jobs until they have been pending
and runnable for at least the specified number of
seconds. In addition, jobs waiting for less than the
specified number of seconds will not prevent a newly
submitted job from starting immediately, even if the
newly submitted job has a lower priority. This can be
valuable if jobs lack time limits or all time limits have
the same value. The default value is zero, which will
reserve resources for any pending job and delay
initiation of lower priority jobs. Also see
bf_min_prio_reserve.
bf_min_prio_reserve=#
The backfill and main scheduling logic will not reserve
resources for pending jobs unless they have a priority
equal to or higher than the specified value. In
addition, jobs with a lower priority will not prevent a
newly submitted job from starting immediately, even if
the newly submitted job has a lower priority. This can
be valuable if one wished to maximum system utilization
without regard for job priority below a certain
threshold. The default value is zero, which will reserve
resources for any pending job and delay initiation of
lower priority jobs. Also see bf_min_age_reserve.
bf_resolution=#
The number of seconds in the resolution of data
maintained about when jobs begin and end. Higher values
result in less overhead and better responsiveness. The
default value is 60 seconds. This option applies only to
SchedulerType=sched/backfill.
bf_window=#
The number of minutes into the future to look when
considering jobs to schedule. Higher values result in
more overhead and less responsiveness. The default value
is 1440 minutes (one day). A value at least as long as
the highest allowed time limit is generally advisable to
prevent job starvation. In order to limit the amount of
data managed by the backfill scheduler, if the value of
bf_window is increased, then it is generally advisable to
also increase bf_resolution. This option applies only to
SchedulerType=sched/backfill.
bf_yield_interval=#
The backfill scheduler will periodically relinquish locks
in order for other pending operations to take place.
This specifies the times when the locks are relinquish in
microseconds. The default value is 2,000,000
microseconds (2 seconds). Smaller values may be helpful
for high throughput computing when used in conjunction
with the bf_continue option. Also see the bf_yield_sleep
option.
bf_yield_sleep=#
The backfill scheduler will periodically relinquish locks
in order for other pending operations to take place.
This specifies the length of time for which the locks are
relinquish in microseconds. The default value is 500,000
microseconds (0.5 seconds). Also see the
bf_yield_interval option.
build_queue_timeout=#
Defines the maximum time that can be devoted to building
a queue of jobs to be tested for scheduling. If the
system has a huge number of jobs with dependencies, just
building the job queue can take so much time as to
adversely impact overall system performance and this
parameter can be adjusted as needed. The default value
is 2,000,000 microseconds (2 seconds).
default_queue_depth=#
The default number of jobs to attempt scheduling (i.e.
the queue depth) when a running job completes or other
routine actions occur, however the frequency with which
the scheduler is run may be limited by using the defer or
sched_min_interval parameters described below. The full
queue will be tested on a less frequent basis as defined
by the sched_interval option described below. The default
value is 100. See the partition_job_depth option to
limit depth by partition.
defer Setting this option will avoid attempting to schedule
each job individually at job submit time, but defer it
until a later time when scheduling multiple jobs
simultaneously may be possible. This option may improve
system responsiveness when large numbers of jobs (many
hundreds) are submitted at the same time, but it will
delay the initiation time of individual jobs. Also see
default_queue_depth above.
disable_user_top
Disable use of the "scontrol top" command by non-
privileged users.
Ignore_NUMA
Some processors (e.g. AMD Opteron 6000 series) contain
multiple NUMA nodes per socket. This is a configuration
which does not map into the hardware entities that Slurm
optimizes resource allocation for (PU/thread, core,
socket, baseboard, node and network switch). In order to
optimize resource allocations on such hardware, Slurm
will consider each NUMA node within the socket as a
separate socket by default. Use the Ignore_NUMA option to
report the correct socket count, but not optimize
resource allocations on the NUMA nodes.
inventory_interval=#
On a Cray system using Slurm on top of ALPS this limits
the number of times a Basil Inventory call is made.
Normally this call happens every scheduling consideration
to attempt to close a node state change window with
respects to what ALPS has. This call is rather slow, so
making it less frequently improves performance
dramatically, but in the situation where a node changes
state the window is as large as this setting. In an HTC
environment this setting is a must and we advise around
10 seconds.
kill_invalid_depend
If a job has an invalid dependency and it can never run
terminate it and set its state to be JOB_CANCELLED. By
default the job stays pending with reason
DependencyNeverSatisfied.
max_depend_depth=#
Maximum number of jobs to test for a circular job
dependency. Stop testing after this number of job
dependencies have been tested. The default value is 10
jobs.
max_rpc_cnt=#
If the number of active threads in the slurmctld daemon
is equal to or larger than this value, defer scheduling
of jobs. This can improve Slurm's ability to process
requests at a cost of initiating new jobs less
frequently. The default value is zero, which disables
this option. If a value is set, then a value of 10 or
higher is recommended.
max_sched_time=#
How long, in seconds, that the main scheduling loop will
execute for before exiting. If a value is configured, be
aware that all other Slurm operations will be deferred
during this time period. Make certain the value is lower
than MessageTimeout. If a value is not explicitly
configured, the default value is half of MessageTimeout
with a minimum default value of 1 second and a maximum
default value of 2 seconds. For example if
MessageTimeout=10, the time limit will be 2 seconds (i.e.
MIN(10/2, 2) = 2).
max_script_size=#
Specify the maximum size of a batch script, in bytes.
The default value is 4 megabytes. Larger values may
adversely impact system performance.
max_switch_wait=#
Maximum number of seconds that a job can delay execution
waiting for the specified desired switch count. The
default value is 300 seconds.
no_backup_scheduling
If used, the backup controller will not schedule jobs
when it takes over. The backup controller will allow jobs
to be submitted, modified and cancelled but won't
schedule new jobs. This is useful in Cray environments
when the backup controller resides on an external Cray
node. A restart is required to alter this option. This
is explicitly set on a Cray/ALPS system.
no_env_cache
If used, any job started on node that fails to load the
env from a node will fail instead of using the cached
env. This will also implicitly imply the
requeue_setup_env_fail option as well.
pack_serial_at_end
If used with the select/cons_res plugin then put serial
jobs at the end of the available nodes rather than using
a best fit algorithm. This may reduce resource
fragmentation for some workloads.
partition_job_depth=#
The default number of jobs to attempt scheduling (i.e.
the queue depth) from each partition/queue in Slurm's
main scheduling logic. The functionality is similar to
that provided by the bf_max_job_part option for the
backfill scheduling logic. The default value is 0 (no
limit). Job's excluded from attempted scheduling based
upon partition will not be counted against the
default_queue_depth limit. Also see the bf_max_job_part
option.
preempt_reorder_count=#
Specify how many attempt should be made in reording
preemptable jobs to minimize the count of jobs preempted.
The default value is 1. High values may adversely impact
performance. The logic to support this option is only
available in the select/cons_res plugin.
preempt_strict_order
If set, then execute extra logic in an attempt to preempt
only the lowest priority jobs. It may be desirable to
set this configuration parameter when there are multiple
priorities of preemptable jobs. The logic to support
this option is only available in the select/cons_res
plugin.
nohold_on_prolog_fail
By default if the Prolog exits with a non-zero value the
job is requeued in held state. By specifying this
parameter the job will be requeued but not held so that
the scheduler can dispatch it to another host.
requeue_setup_env_fail
By default if a job environment setup fails the job keeps
running with a limited environment. By specifying this
parameter the job will be requeued in held state and the
execution node drained.
sched_interval=#
How frequently, in seconds, the main scheduling loop will
execute and test all pending jobs. The default value is
60 seconds.
sched_max_job_start=#
The maximum number of jobs that the main scheduling logic
will start in any single execution. The default value is
zero, which imposes no limit.
sched_min_interval=#
How frequently, in microseconds, the main scheduling loop
will execute and test any pending jobs. The scheduler
runs in a limited fashion every time that any event
happens which could enable a job to start (e.g. job
submit, job terminate, etc.). If these events happen at
a high frequency, the scheduler can run very frequently
and consume significant resources if not throttled by
this option. This option specifies the minimum time
between the end of one scheduling cycle and the beginning
of the next scheduling cycle. A value of zero will
disable throttling of the scheduling logic interval. The
default value is 1,000,000 microseconds on Cray/ALPS
systems and zero microseconds (throttling is disabled) on
other systems.
SchedulerPort
The port number on which slurmctld should listen for connection
requests. This value is only used by the Maui Scheduler (see
SchedulerType). The default value is 7321.
SchedulerRootFilter
Identifies whether or not RootOnly partitions should be filtered
from any external scheduling activities. If set to 0, then
RootOnly partitions are treated like any other partition. If set
to 1, then RootOnly partitions are exempt from any external
scheduling activities. The default value is 1. Currently only
used by the built-in backfill scheduling module "sched/backfill"
(see SchedulerType).
SchedulerTimeSlice
Number of seconds in each time slice when gang scheduling is
enabled (PreemptMode=SUSPEND,GANG). The value must be between 5
seconds and 65533 seconds. The default value is 30 seconds.
SchedulerType
Identifies the type of scheduler to be used. Note the slurmctld
daemon must be restarted for a change in scheduler type to
become effective (reconfiguring a running daemon has no effect
for this parameter). The scontrol command can be used to
manually change job priorities if desired. Acceptable values
include:
sched/backfill
For a backfill scheduling module to augment the default
FIFO scheduling. Backfill scheduling will initiate
lower-priority jobs if doing so does not delay the
expected initiation time of any higher priority job.
Effectiveness of backfill scheduling is dependent upon
users specifying job time limits, otherwise all jobs will
have the same time limit and backfilling is impossible.
Note documentation for the SchedulerParameters option
above. This is the default configuration.
sched/builtin
This is the FIFO scheduler which initiates jobs in
priority order. If any job in the partition can not be
scheduled, no lower priority job in that partition will
be scheduled. An exception is made for jobs that can not
run due to partition constraints (e.g. the time limit) or
down/drained nodes. In that case, lower priority jobs
can be initiated and not impact the higher priority job.
sched/hold
To hold all newly arriving jobs if a file
"/etc/slurm.hold" exists otherwise use the built-in FIFO
scheduler
sched/wiki
For the Wiki interface to the Maui Scheduler
sched/wiki2
For the Wiki interface to the Moab Cluster Suite
SelectType
Identifies the type of resource selection algorithm to be used.
Changing this value can only be done by restarting the slurmctld
daemon and will result in the loss of all job information
(running and pending) since the job state save format used by
each plugin is different. Acceptable values include
select/bluegene
for a three-dimensional BlueGene system. The default
value is "select/bluegene" for BlueGene systems.
select/cons_res
The resources within a node are individually allocated as
consumable resources. Note that whole nodes can be
allocated to jobs for selected partitions by using the
OverSubscribe=Exclusive option. See the partition
OverSubscribe parameter for more information.
select/cray
for a Cray system. The default value is "select/cray"
for all Cray systems.
select/linear
for allocation of entire nodes assuming a one-dimensional
array of nodes in which sequentially ordered nodes are
preferable. This is the default value for non-BlueGene
systems.
select/serial
for allocating resources to single CPU jobs only. Highly
optimized for maximum throughput. NOTE: SPANK
environment variables are NOT propagated to the job's
Epilog program.
SelectTypeParameters
The permitted values of SelectTypeParameters depend upon the
configured value of SelectType. SelectType=select/bluegene
supports no SelectTypeParameters. The only supported options
for SelectType=select/linear are CR_ONE_TASK_PER_CORE and
CR_Memory, which treats memory as a consumable resource and
prevents memory over subscription with job preemption or gang
scheduling. By default SelectType=select/linear allocates whole
nodes to jobs without considering their memory consumption. By
default SelectType=select/cons_res, SelectType=select/cray, and
SelectType=select/serial use CR_CPU, which allocates CPU to jobs
without considering their memory consumption.
The following options are supported for SelectType=select/cray:
OTHER_CONS_RES
Layer the select/cons_res plugin under the
select/cray plugin, the default is to layer on
select/linear. This also allows all the options
for SelectType=select/cons_res.
NHC_NO_STEPS
Do not run the node health check after each step.
Default is to run after each step.
NHC_NO Do not run the node health check after each
allocation. Default is to run after each
allocation. This also sets NHC_NO_STEPS, so the
NHC will never run.
The following options are supported for
SelectType=select/cons_res:
CR_CPU CPUs are consumable resources. Configure the
number of CPUs on each node, which may be equal to
the count of cores or hyper-threads on the node
depending upon the desired minimum resource
allocation. The node's Boards, Sockets,
CoresPerSocket and ThreadsPerCore may optionally
be configured and result in job allocations which
have improved locality; however doing so will
prevent more than one job being from being
allocated on each core.
CR_CPU_Memory
CPUs and memory are consumable resources.
Configure the number of CPUs on each node, which
may be equal to the count of cores or
hyper-threads on the node depending upon the
desired minimum resource allocation. The node's
Boards, Sockets, CoresPerSocket and ThreadsPerCore
may optionally be configured and result in job
allocations which have improved locality; however
doing so will prevent more than one job being from
being allocated on each core. Setting a value for
DefMemPerCPU is strongly recommended.
CR_Core
Cores are consumable resources. On nodes with
hyper-threads, each thread is counted as a CPU to
satisfy a job's resource requirement, but multiple
jobs are not allocated threads on the same core.
The count of CPUs allocated to a job may be
rounded up to account for every CPU on an
allocated core.
CR_Core_Memory
Cores and memory are consumable resources. On
nodes with hyper-threads, each thread is counted
as a CPU to satisfy a job's resource requirement,
but multiple jobs are not allocated threads on the
same core. The count of CPUs allocated to a job
may be rounded up to account for every CPU on an
allocated core. Setting a value for DefMemPerCPU
is strongly recommended.
CR_ONE_TASK_PER_CORE
Allocate one task per core by default. Without
this option, by default one task will be allocated
per thread on nodes with more than one
ThreadsPerCore configured.
CR_CORE_DEFAULT_DIST_BLOCK
Allocate cores within a node using block
distribution by default. This is a
pseudo-best-fit algorithm that minimizes the
number of boards and minimizes the number of
sockets (within minimum boards) used for the
allocation. This default behavior can be
overridden specifying a particular "-m" parameter
with srun/salloc/sbatch. Without this option,
cores will be allocated cyclicly across the
sockets.
CR_LLN Schedule resources to jobs on the least loaded
nodes (based upon the number of idle CPUs). This
is generally only recommended for an environment
with serial jobs as idle resources will tend to be
highly fragmented, resulting in parallel jobs
being distributed across many nodes. Also see the
partition configuration parameter LLN use the
least loaded nodes in selected partitions.
CR_Pack_Nodes
If a job allocation contains more resources than
will be used for launching tasks (e.g. if whole
nodes are allocated to a job), then rather than
distributing a job's tasks evenly across it's
allocated nodes, pack them as tightly as possible
on these nodes. For example, consider a job
allocation containing two entire nodes with eight
CPUs each. If the job starts ten tasks across
those two nodes without this option, it will start
five tasks on each of the two nodes. With this
option, eight tasks will be started on the first
node and two tasks on the second node.
CR_Socket
Sockets are consumable resources. On nodes with
multiple cores, each core or thread is counted as
a CPU to satisfy a job's resource requirement, but
multiple jobs are not allocated resources on the
same socket.
CR_Socket_Memory
Memory and sockets are consumable resources. On
nodes with multiple cores, each core or thread is
counted as a CPU to satisfy a job's resource
requirement, but multiple jobs are not allocated
resources on the same socket. Setting a value for
DefMemPerCPU is strongly recommended.
CR_Memory
Memory is a consumable resource. NOTE: This
implies OverSubscribe=YES or OverSubscribe=FORCE
for all partitions. Setting a value for
DefMemPerCPU is strongly recommended.
SlurmUser
The name of the user that the slurmctld daemon executes as. For
security purposes, a user other than "root" is recommended.
This user must exist on all nodes of the cluster for
authentication of communications between Slurm components. The
default value is "root".
SlurmdUser
The name of the user that the slurmd daemon executes as. This
user must exist on all nodes of the cluster for authentication
of communications between Slurm components. The default value
is "root".
SlurmctldDebug
The level of detail to provide slurmctld daemon's logs. The
default value is info. If the slurmctld daemon is initiated
with -v or --verbose options, that debug level will be preserve
or restored upon reconfiguration.
quiet Log nothing
fatal Log only fatal errors
error Log only errors
info Log errors and general informational messages
verbose Log errors and verbose informational messages
debug Log errors and verbose informational messages and
debugging messages
debug2 Log errors and verbose informational messages and more
debugging messages
debug3 Log errors and verbose informational messages and even
more debugging messages
debug4 Log errors and verbose informational messages and even
more debugging messages
debug5 Log errors and verbose informational messages and even
more debugging messages
SlurmctldLogFile
Fully qualified pathname of a file into which the slurmctld
daemon's logs are written. The default value is none (performs
logging via syslog).
See the section LOGGING if a pathname is specified.
SlurmctldPidFile
Fully qualified pathname of a file into which the slurmctld
daemon may write its process id. This may be used for automated
signal processing. The default value is
"/var/run/slurmctld.pid".
SlurmctldPlugstack
A comma delimited list of Slurm controller plugins to be started
when the daemon begins and terminated when it ends. Only the
plugin's init and fini functions are called.
SlurmctldPort
The port number that the Slurm controller, slurmctld, listens to
for work. The default value is SLURMCTLD_PORT as established at
system build time. If none is explicitly specified, it will be
set to 6817. SlurmctldPort may also be configured to support a
range of port numbers in order to accept larger bursts of
incoming messages by specifying two numbers separated by a dash
(e.g. SlurmctldPort=6817-6818). NOTE: Either slurmctld and
slurmd daemons must not execute on the same nodes or the values
of SlurmctldPort and SlurmdPort must be different.
Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
automatically try to interact with anything opened on ports
8192-60000. Configure SlurmctldPort to use a port outside of
the configured SrunPortRange and RSIP's port range.
SlurmctldTimeout
The interval, in seconds, that the backup controller waits for
the primary controller to respond before assuming control. The
default value is 120 seconds. May not exceed 65533.
SlurmdDebug
The level of detail to provide slurmd daemon's logs. The
default value is info.
quiet Log nothing
fatal Log only fatal errors
error Log only errors
info Log errors and general informational messages
verbose Log errors and verbose informational messages
debug Log errors and verbose informational messages and
debugging messages
debug2 Log errors and verbose informational messages and more
debugging messages
debug3 Log errors and verbose informational messages and even
more debugging messages
debug4 Log errors and verbose informational messages and even
more debugging messages
debug5 Log errors and verbose informational messages and even
more debugging messages
SlurmdLogFile
Fully qualified pathname of a file into which the slurmd
daemon's logs are written. The default value is none (performs
logging via syslog). Any "%h" within the name is replaced with
the hostname on which the slurmd is running. Any "%n" within
the name is replaced with the Slurm node name on which the
slurmd is running.
See the section LOGGING if a pathname is specified.
SlurmdPidFile
Fully qualified pathname of a file into which the slurmd daemon
may write its process id. This may be used for automated signal
processing. Any "%h" within the name is replaced with the
hostname on which the slurmd is running. Any "%n" within the
name is replaced with the Slurm node name on which the slurmd is
running. The default value is "/var/run/slurmd.pid".
SlurmdPlugstack
A comma delimited list of Slurm compute node plugins to be
started when the daemon begins and terminated when it ends.
Only the plugin's init and fini functions are called.
SlurmdPort
The port number that the Slurm compute node daemon, slurmd,
listens to for work. The default value is SLURMD_PORT as
established at system build time. If none is explicitly
specified, its value will be 6818. NOTE: Either slurmctld and
slurmd daemons must not execute on the same nodes or the values
of SlurmctldPort and SlurmdPort must be different.
Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
automatically try to interact with anything opened on ports
8192-60000. Configure SlurmdPort to use a port outside of the
configured SrunPortRange and RSIP's port range.
SlurmdSpoolDir
Fully qualified pathname of a directory into which the slurmd
daemon's state information and batch job script information are
written. This must be a common pathname for all nodes, but
should represent a directory which is local to each node
(reference a local file system). The default value is
"/var/spool/slurmd". Any "%h" within the name is replaced with
the hostname on which the slurmd is running. Any "%n" within
the name is replaced with the Slurm node name on which the
slurmd is running.
SlurmdTimeout
The interval, in seconds, that the Slurm controller waits for
slurmd to respond before configuring that node's state to DOWN.
A value of zero indicates the node will not be tested by
slurmctld to confirm the state of slurmd, the node will not be
automatically set to a DOWN state indicating a non-responsive
slurmd, and some other tool will take responsibility for
monitoring the state of each compute node and its slurmd daemon.
Slurm's hierarchical communication mechanism is used to ping the
slurmd daemons in order to minimize system noise and overhead.
The default value is 300 seconds. The value may not exceed
65533 seconds.
SlurmSchedLogFile
Fully qualified pathname of the scheduling event logging file.
The syntax of this parameter is the same as for
SlurmctldLogFile. In order to configure scheduler logging, set
both the SlurmSchedLogFile and SlurmSchedLogLevel parameters.
SlurmSchedLogLevel
The initial level of scheduling event logging, similar to the
SlurmctldDebug parameter used to control the initial level of
slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
(scheduler logging disabled) and "1" (scheduler logging
enabled). If this parameter is omitted, the value defaults to
"0" (disabled). In order to configure scheduler logging, set
both the SlurmSchedLogFile and SlurmSchedLogLevel parameters.
The scheduler logging level can be changed dynamically using
scontrol.
SrunEpilog
Fully qualified pathname of an executable to be run by srun
following the completion of a job step. The command line
arguments for the executable will be the command and arguments
of the job step. This configuration parameter may be overridden
by srun's --epilog parameter. Note that while the other "Epilog"
executables (e.g., TaskEpilog) are run by slurmd on the compute
nodes where the tasks are executed, the SrunEpilog runs on the
node where the "srun" is executing.
SrunPortRange
The srun creates a set of listening ports to communicate with
the controller, the slurmstepd and to handle the application
I/O. By default these ports are ephemeral meaning the port
numbers are selected by the kernel. Using this parameter allow
sites to configure a range of ports from which srun ports will
be selected. This is useful if sites want to allow only certain
port range on their network.
Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
automatically try to interact with anything opened on ports
8192-60000. Configure SrunPortRange to use a range of ports
above those used by RSIP, ideally 1000 or more ports, for
example "SrunPortRange=60001-63000".
Note: A sufficient number of ports must be configured based on
the estimated number of srun on the submission nodes considering
that srun opens 3 listening ports plus 2 more for every 48
hosts. Example:
srun -N 48 will use 5 listening ports.
srun -N 50 will use 7 listening ports.
srun -N 200 will use 13 listening ports.
SrunProlog
Fully qualified pathname of an executable to be run by srun
prior to the launch of a job step. The command line arguments
for the executable will be the command and arguments of the job
step. This configuration parameter may be overridden by srun's
--prolog parameter. Note that while the other "Prolog"
executables (e.g., TaskProlog) are run by slurmd on the compute
nodes where the tasks are executed, the SrunProlog runs on the
node where the "srun" is executing.
StateSaveLocation
Fully qualified pathname of a directory into which the Slurm
controller, slurmctld, saves its state (e.g.
"/usr/local/slurm/checkpoint"). Slurm state will saved here to
recover from system failures. SlurmUser must be able to create
files in this directory. If you have a BackupController
configured, this location should be readable and writable by
both systems. Since all running and pending job information is
stored here, the use of a reliable file system (e.g. RAID) is
recommended. The default value is "/var/spool". If any slurm
daemons terminate abnormally, their core files will also be
written into this directory.
SuspendExcNodes
Specifies the nodes which are to not be placed in power save
mode, even if the node remains idle for an extended period of
time. Use Slurm's hostlist expression to identify nodes. By
default no nodes are excluded. Related configuration options
include ResumeTimeout, ResumeProgram, ResumeRate,
SuspendProgram, SuspendRate, SuspendTime, SuspendTimeout, and
SuspendExcParts.
SuspendExcParts
Specifies the partitions whose nodes are to not be placed in
power save mode, even if the node remains idle for an extended
period of time. Multiple partitions can be identified and
separated by commas. By default no nodes are excluded. Related
configuration options include ResumeTimeout, ResumeProgram,
ResumeRate, SuspendProgram, SuspendRate, SuspendTime
SuspendTimeout, and SuspendExcNodes.
SuspendProgram
SuspendProgram is the program that will be executed when a node
remains idle for an extended period of time. This program is
expected to place the node into some power save mode. This can
be used to reduce the frequency and voltage of a node or
completely power the node off. The program executes as
SlurmUser. The argument to the program will be the names of
nodes to be placed into power savings mode (using Slurm's
hostlist expression format). By default, no program is run.
Related configuration options include ResumeTimeout,
ResumeProgram, ResumeRate, SuspendRate, SuspendTime,
SuspendTimeout, SuspendExcNodes, and SuspendExcParts.
SuspendRate
The rate at which nodes are place into power save mode by
SuspendProgram. The value is number of nodes per minute and it
can be used to prevent a large drop in power consumption (e.g.
after a large job completes). A value of zero results in no
limits being imposed. The default value is 60 nodes per minute.
Related configuration options include ResumeTimeout,
ResumeProgram, ResumeRate, SuspendProgram, SuspendTime,
SuspendTimeout, SuspendExcNodes, and SuspendExcParts.
SuspendTime
Nodes which remain idle for this number of seconds will be
placed into power save mode by SuspendProgram. A value of -1
disables power save mode and is the default. Related
configuration options include ResumeTimeout, ResumeProgram,
ResumeRate, SuspendProgram, SuspendRate, SuspendTimeout,
SuspendExcNodes, and SuspendExcParts.
SuspendTimeout
Maximum time permitted (in second) between when a node suspend
request is issued and when the node shutdown. At that time the
node must ready for a resume request to be issued as needed for
new work. The default value is 30 seconds. Related
configuration options include ResumeProgram, ResumeRate,
ResumeTimeout, SuspendRate, SuspendTime, SuspendProgram,
SuspendExcNodes and SuspendExcParts. More information is
available at the Slurm web site (
http://slurm.schedmd.com/power_save.html ).
SwitchType
Identifies the type of switch or interconnect used for
application communications. Acceptable values include
"switch/none" for switches not requiring special processing for
job launch or termination (Myrinet, Ethernet, and InfiniBand)
and "switch/nrt" for IBM's Network Resource Table API. The
default value is "switch/none". All Slurm daemons, commands and
running jobs must be restarted for a change in SwitchType to
take effect. If running jobs exist at the time slurmctld is
restarted with a new value of SwitchType, records of all jobs in
any state may be lost.
TaskEpilog
Fully qualified pathname of a program to be execute as the slurm
job's owner after termination of each task. See TaskProlog for
execution order details.
TaskPlugin
Identifies the type of task launch plugin, typically used to
provide resource management within a node (e.g. pinning tasks to
specific processors). More than one task plugin can be specified
in a comma separated list. The prefix of "task/" is optional.
Acceptable values include:
task/affinity enables resource containment using CPUSETs. This
enables the --cpu_bind and/or --mem_bind srun
options. If you use "task/affinity" and
encounter problems, it may be due to the variety
of system calls used to implement task affinity
on different operating systems.
task/cgroup enables resource containment using Linux control
cgroups. This enables the --cpu_bind and/or
--mem_bind srun options. NOTE: see "man
cgroup.conf" for configuration details. NOTE:
This plugin writes to disk and can slightly
impact performance. If you are running lots of
short running jobs (less than a couple of
seconds) this plugin slows down performance
slightly. It should probably be avoided in an
HTC environment.
task/none for systems requiring no special handling of user
tasks. Lacks support for the --cpu_bind and/or
--mem_bind srun options. The default value is
"task/none".
TaskPluginParam
Optional parameters for the task plugin. Multiple options
should be comma separated. If None, Boards, Sockets, Cores,
Threads, and/or Verbose are specified, they will override the
--cpu_bind option specified by the user in the srun command.
None, Boards, Sockets, Cores and Threads are mutually exclusive
and since they decrease scheduling flexibility are not generally
recommended (select no more than one of them). Cpusets and
Sched are mutually exclusive (select only one of them). All
TaskPluginParam options are supported on FreeBSD except Cpusets.
The Sched option uses cpuset_setaffinity() on FreeBSD, not
sched_setaffinity().
Boards Bind tasks to boards by default. Overrides automatic
binding.
Cores Bind tasks to cores by default. Overrides automatic
binding.
Cpusets Use cpusets to perform task affinity functions. By
default, Sched task binding is performed.
None Perform no task binding by default. Overrides
automatic binding.
Sched Use sched_setaffinity (if available) to bind tasks to
processors.
Sockets Bind to sockets by default. Overrides automatic
binding.
Threads Bind to threads by default. Overrides automatic
binding.
Verbose Verbosely report binding before tasks run. Overrides
user options.
Autobind Set a default binding in the event that "auto binding"
doesn't find a match. Set to Threads, Cores or
Sockets (E.g. TaskPluginParam=autobind=threads).
TaskProlog
Fully qualified pathname of a program to be execute as the slurm
job's owner prior to initiation of each task. Besides the
normal environment variables, this has SLURM_TASK_PID available
to identify the process ID of the task being started. Standard
output from this program can be used to control the environment
variables and output for the user program.
export NAME=value Will set environment variables for the task
being spawned. Everything after the equal
sign to the end of the line will be used as
the value for the environment variable.
Exporting of functions is not currently
supported.
print ... Will cause that line (without the leading
"print ") to be printed to the job's
standard output.
unset NAME Will clear environment variables for the
task being spawned.
The order of task prolog/epilog execution is as follows:
1. pre_launch_priv()
Function in TaskPlugin
1. pre_launch() Function in TaskPlugin
2. TaskProlog System-wide per task program defined in
slurm.conf
3. user prolog Job step specific task program defined using
srun's --task-prolog option or
SLURM_TASK_PROLOG environment variable
4. Execute the job step's task
5. user epilog Job step specific task program defined using
srun's --task-epilog option or
SLURM_TASK_EPILOG environment variable
6. TaskEpilog System-wide per task program defined in
slurm.conf
7. post_term() Function in TaskPlugin
TCPTimeout
Time permitted for TCP connection to be established. Default
value is 2 seconds.
TmpFS Fully qualified pathname of the file system available to user
jobs for temporary storage. This parameter is used in
establishing a node's TmpDisk space. The default value is
"/tmp".
TopologyParam
Comma separated options identifing network topology options.
Dragonfly Optimize allocation for Dragonfly network. Valid
when TopologyPlugin=topology/tree.
NoCtldInAddrAny
Used to directly bind to the address of what the
node resolves to running the slurmctld instead of
binding messages to any address on the node,
which is the default.
NoInAddrAny Used to directly bind to the address of what the
node resolves to instead of binding messages to
any address on the node which is the default.
This option is for all daemons/clients except for
the slurmctld.
TopoOptional Only optimize allocation for network topology if
the job includes a switch option. Since
optimizing resource allocation for topology
involves much higher system overhead, this option
can be used to impose the extra overhead only on
jobs which can take advantage of it. If most job
allocations are not optimized for network
topology, they make fragment resources to the
point that topology optimization for other jobs
will be difficult to achieve.
TopologyPlugin
Identifies the plugin to be used for determining the network
topology and optimizing job allocations to minimize network
contention. See NETWORK TOPOLOGY below for details. Additional
plugins may be provided in the future which gather topology
information directly from the network. Acceptable values
include:
topology/3d_torus best-fit logic over three-dimensional
topology
topology/node_rank orders nodes based upon information a
node_rank field in the node record as
generated by a select plugin. Slurm
performs a best-fit algorithm over those
ordered nodes
topology/none default for other systems, best-fit logic
over one-dimensional topology
topology/tree used for a hierarchical network as
described in a topology.conf file
TrackWCKey
Boolean yes or no. Used to set display and track of the
Workload Characterization Key. Must be set to track correct
wckey usage. NOTE: You must also set TrackWCKey in your
slurmdbd.conf file to create historical usage reports.
TreeWidth
Slurmd daemons use a virtual tree network for communications.
TreeWidth specifies the width of the tree (i.e. the fanout). On
architectures with a front end node running the slurmd daemon,
the value must always be equal to or greater than the number of
front end nodes which eliminates the need for message forwarding
between the slurmd daemons. On other architectures the default
value is 50, meaning each slurmd daemon can communicate with up
to 50 other slurmd daemons and over 2500 nodes can be contacted
with two message hops. The default value will work well for
most clusters. Optimal system performance can typically be
achieved if TreeWidth is set to the square root of the number of
nodes in the cluster for systems having no more than 2500 nodes
or the cube root for larger systems. The value may not exceed
65533.
UnkillableStepProgram
If the processes in a job step are determined to be unkillable
for a period of time specified by the UnkillableStepTimeout
variable, the program specified by UnkillableStepProgram will be
executed. This program can be used to take special actions to
clean up the unkillable processes and/or notify computer
administrators. The program will be run SlurmdUser (usually
"root") on the compute node. By default no program is run.
UnkillableStepTimeout
The length of time, in seconds, that Slurm will wait before
deciding that processes in a job step are unkillable (after they
have been signaled with SIGKILL) and execute
UnkillableStepProgram as described above. The default timeout
value is 60 seconds.
UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
will be enabled. PAM is used to establish the upper bounds for
resource limits. With PAM support enabled, local system
administrators can dynamically configure system resource limits.
Changing the upper bound of a resource limit will not alter the
limits of running jobs, only jobs started after a change has
been made will pick up the new limits. The default value is 0
(not to enable PAM support). Remember that PAM also needs to be
configured to support Slurm as a service. For sites using PAM's
directory based configuration option, a configuration file named
slurm should be created. The module-type, control-flags, and
module-path names that should be included in the file are:
auth required pam_localuser.so
auth required pam_shells.so
account required pam_unix.so
account required pam_access.so
session required pam_unix.so
For sites configuring PAM with a general configuration file, the
appropriate lines (see above), where slurm is the service-name,
should be added.
VSizeFactor
Memory specifications in job requests apply to real memory size
(also known as resident set size). It is possible to enforce
virtual memory limits for both jobs and job steps by limiting
their virtual memory to some percentage of their real memory
allocation. The VSizeFactor parameter specifies the job's or job
step's virtual memory limit as a percentage of its real memory
limit. For example, if a job's real memory limit is 500MB and
VSizeFactor is set to 101 then the job will be killed if its
real memory exceeds 500MB or its virtual memory exceeds 505MB
(101 percent of the real memory limit). The default value is 0,
which disables enforcement of virtual memory limits. The value
may not exceed 65533 percent.
WaitTime
Specifies how many seconds the srun command should by default
wait after the first task terminates before terminating all
remaining tasks. The "--wait" option on the srun command line
overrides this value. The default value is 0, which disables
this feature. May not exceed 65533 seconds.
The configuration of nodes (or machines) to be managed by Slurm is also
specified in /etc/slurm.conf. Changes in node configuration (e.g.
adding nodes, changing their processor count, etc.) require restarting
both the slurmctld daemon and the slurmd daemons. All slurmd daemons
must know each node in the system to forward messages in support of
hierarchical communications. Only the NodeName must be supplied in the
configuration file. All other node configuration information is
optional. It is advisable to establish baseline node configurations,
especially if the cluster is heterogeneous. Nodes which register to
the system with less than the configured resources (e.g. too little
memory), will be placed in the "DOWN" state to avoid scheduling jobs on
them. Establishing baseline configurations will also speed Slurm's
scheduling process by permitting it to compare job requirements against
these (relatively few) configuration parameters and possibly avoid
having to check job requirements against every individual node's
configuration. The resources checked at node registration time are:
CPUs, RealMemory and TmpDisk. While baseline values for each of these
can be established in the configuration file, the actual values upon
node registration are recorded and these actual values may be used for
scheduling purposes (depending upon the value of FastSchedule in the
configuration file.
Default values can be specified with a record in which NodeName is
"DEFAULT". The default entry values will apply only to lines following
it in the configuration file and the default values can be reset
multiple times in the configuration file with multiple entries where
"NodeName=DEFAULT". Each line where NodeName is "DEFAULT" will replace
or add to previous default values and not a reinitialize the default
values. The "NodeName=" specification must be placed on every line
describing the configuration of nodes. A single node name can not
appear as a NodeName value in more than one line (duplicate node name
records will be ignored). In fact, it is generally possible and
desirable to define the configurations of all nodes in only a few
lines. This convention permits significant optimization in the
scheduling of larger clusters. In order to support the concept of jobs
requiring consecutive nodes on some architectures, node specifications
should be place in this file in consecutive order. No single node name
may be listed more than once in the configuration file. Use
"DownNodes=" to record the state of nodes which are temporarily in a
DOWN, DRAIN or FAILING state without altering permanent configuration
information. A job step's tasks are allocated to nodes in order the
nodes appear in the configuration file. There is presently no
capability within Slurm to arbitrarily order a job step's tasks.
Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
and/or a simple node range expression may optionally be used to specify
numeric ranges of nodes to avoid building a configuration file with
large numbers of entries. The node range expression can contain one
pair of square brackets with a sequence of comma separated numbers
and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
"lx[15,18,32-33]"). Note that the numeric ranges can include one or
more leading zeros to indicate the numeric portion has a fixed number
of digits (e.g. "linux[0000-1023]"). Up to two numeric ranges can be
included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
more numeric expressions are included, one of them must be at the end
of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
always be used in a comma separated list.
On BlueGene systems only, the square brackets should contain pairs of
three digit numbers separated by a "x". These numbers indicate the
boundaries of a rectangular prism (e.g. "bgl[000x144,400x544]"). See
BlueGene documentation for more details. The node configuration
specified the following information:
NodeName
Name that Slurm uses to refer to a node (or base partition for
BlueGene systems). Typically this would be the string that
"/bin/hostname -s" returns. It may also be the fully qualified
domain name as returned by "/bin/hostname -f" (e.g.
"foo1.bar.com"), or any valid domain name associated with the
host through the host database (/etc/hosts) or DNS, depending on
the resolver settings. Note that if the short form of the
hostname is not used, it may prevent use of hostlist expressions
(the numeric portion in brackets must be at the end of the
string). Only short hostname forms are compatible with the
switch/nrt plugin at this time. It may also be an arbitrary
string if NodeHostname is specified. If the NodeName is
"DEFAULT", the values specified with that record will apply to
subsequent node specifications unless explicitly set to other
values in that node record or replaced with a different set of
default values. Each line where NodeName is "DEFAULT" will
replace or add to previous default values and not a reinitialize
the default values. For architectures in which the node order
is significant, nodes will be considered consecutive in the
order defined. For example, if the configuration for
"NodeName=charlie" immediately follows the configuration for
"NodeName=baker" they will be considered adjacent in the
computer.
NodeHostname
Typically this would be the string that "/bin/hostname -s"
returns. It may also be the fully qualified domain name as
returned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any
valid domain name associated with the host through the host
database (/etc/hosts) or DNS, depending on the resolver
settings. Note that if the short form of the hostname is not
used, it may prevent use of hostlist expressions (the numeric
portion in brackets must be at the end of the string). Only
short hostname forms are compatible with the switch/nrt plugin
at this time. A node range expression can be used to specify a
set of nodes. If an expression is used, the number of nodes
identified by NodeHostname on a line in the configuration file
must be identical to the number of nodes identified by NodeName.
By default, the NodeHostname will be identical in value to
NodeName.
NodeAddr
Name that a node should be referred to in establishing a
communications path. This name will be used as an argument to
the gethostbyname() function for identification. If a node
range expression is used to designate multiple nodes, they must
exactly match the entries in the NodeName (e.g.
"NodeName=lx[0-7] NodeAddr=elx[0-7]"). NodeAddr may also
contain IP addresses. By default, the NodeAddr will be
identical in value to NodeHostname.
Boards Number of Baseboards in nodes with a baseboard controller. Note
that when Boards is specified, SocketsPerBoard, CoresPerSocket,
and ThreadsPerCore should be specified. Boards and CPUs are
mutually exclusive. The default value is 1.
CoreSpecCount
Number of cores on which Slurm compute node daemons (slurmd,
slurmstepd) will be confined. These cores will not be available
for allocation to user jobs. Isolation of the Slurm daemons
from user jobs may improve performance. If this option and
CPUSpecList are both designated for a node, an error is
generated. For information on the algorithm used by Slurm to
select the cores refer to the core specialization documentation
( http://slurm.schedmd.com/core_spec.html ). This option has no
effect unless cgroup job confinement is also configured
(TaskPlugin=task/cgroup with ConstrainCores=yes in cgroup.conf).
CoresPerSocket
Number of cores in a single physical processor socket (e.g.
"2"). The CoresPerSocket value describes physical cores, not
the logical number of processors per socket. NOTE: If you have
multi-core processors, you will likely need to specify this
parameter in order to optimize scheduling. The default value is
1.
CPUs Number of logical processors on the node (e.g. "2"). CPUs and
Boards are mutually exclusive. It can be set to the total number
of sockets, cores or threads. This can be useful when you want
to schedule only the cores on a hyper-threaded node. If CPUs is
omitted, it will be set equal to the product of Sockets,
CoresPerSocket, and ThreadsPerCore. The default value is 1.
CPUSpecList
A comma delimited list of Slurm abstract CPU IDs on which Slurm
compute node daemons (slurmd, slurmstepd) will be confined. The
list will be expanded to include all other CPUs, if any, on the
same cores. These cores will not be available for allocation to
user jobs. Isolation of the Slurm daemons from user jobs may
improve performance. If this option and CoreSpecCount are both
designated for a node, an error is generated. This option has
no effect unless cgroup job confinement is also configured
(TaskPlugin=task/cgroup with ConstrainCores=yes in cgroup.conf).
Feature
A comma delimited list of arbitrary strings indicative of some
characteristic associated with the node. There is no value
associated with a feature at this time, a node either has a
feature or it does not. If desired a feature may contain a
numeric component indicating, for example, processor speed. By
default a node has no features. Also see Gres.
Gres A comma delimited list of generic resources specifications for a
node. The format is:
"<name>[:<type>][:no_consume]:<number>[K|M|G]". The first field
is the resource name, which matches the GresType configuration
parameter name. The optional type field might be used to
identify a model of that generic resource. A generic resource
can also be specified as non-consumable (i.e. multiple jobs can
use the same generic resource) with the optional field
":no_consume". The final field must specify a generic resources
count. A suffix of "K", "M", "G", "T" or "P" may be used to
multiply the number by 1024, 1048576, 1073741824, etc.
respectively.
(e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G").
By default a node has no generic resources and its maximum count
is that of an unsigned 64bit integer. Also see Feature.
MemSpecLimit
Limit on combined real memory allocation for compute node
daemons (slurmd, slurmstepd), in megabytes. This memory is not
available to job allocations. The daemons won't be killed when
they exhaust the memory allocation (ie. the OOM Killer is
disabled for the daemon's memory cgroup). This option has no
effect unless cgroup job confinement is also configured
(TaskPlugin=task/cgroup with ConstrainRAMSpace=yes in
cgroup.conf).
Port The port number that the Slurm compute node daemon, slurmd,
listens to for work on this particular node. By default there is
a single port number for all slurmd daemons on all compute nodes
as defined by the SlurmdPort configuration parameter. Use of
this option is not generally recommended except for development
or testing purposes. If multiple slurmd daemons execute on a
node this can specify a range of ports.
Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
automatically try to interact with anything opened on ports
8192-60000. Configure Port to use a port outside of the
configured SrunPortRange and RSIP's port range.
Procs See CPUs.
RealMemory
Size of real memory on the node in MegaBytes (e.g. "2048"). The
default value is 1.
Reason Identifies the reason for a node being in state "DOWN",
"DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to
enclose a reason having more than one word.
Sockets
Number of physical processor sockets/chips on the node (e.g.
"2"). If Sockets is omitted, it will be inferred from CPUs,
CoresPerSocket, and ThreadsPerCore. NOTE: If you have
multi-core processors, you will likely need to specify these
parameters. Sockets and SocketsPerBoard are mutually exclusive.
If Sockets is specified when Boards is also used, Sockets is
interpreted as SocketsPerBoard rather than total sockets. The
default value is 1.
SocketsPerBoard
Number of physical processor sockets/chips on a baseboard.
Sockets and SocketsPerBoard are mutually exclusive. The default
value is 1.
State State of the node with respect to the initiation of user jobs.
Acceptable values are "CLOUD", "DOWN", "DRAIN", "FAIL",
"FAILING", "FUTURE" and "UNKNOWN". Node states of "BUSY" and
"IDLE" should not be specified in the node configuration, but
set the node state to "UNKNOWN" instead. Setting the node state
to "UNKNOWN" will result in the node state being set to "BUSY",
"IDLE" or other appropriate state based upon recovered system
state information. The default value is "UNKNOWN". Also see
the DownNodes parameter below.
CLOUD Indicates the node exists in the cloud. It's initial
state will be treated as powered down. The node will
be available for use after it's state is recovered
from Slurm's state save file or the slurmd daemon
starts on the compute node.
DOWN Indicates the node failed and is unavailable to be
allocated work.
DRAIN Indicates the node is unavailable to be allocated
work.on.
FAIL Indicates the node is expected to fail soon, has no
jobs allocated to it, and will not be allocated to any
new jobs.
FAILING Indicates the node is expected to fail soon, has one
or more jobs allocated to it, but will not be
allocated to any new jobs.
FUTURE Indicates the node is defined for future use and need
not exist when the Slurm daemons are started. These
nodes can be made available for use simply by updating
the node state using the scontrol command rather than
restarting the slurmctld daemon. After these nodes are
made available, change their State in the slurm.conf
file. Until these nodes are made available, they will
not be seen using any Slurm commands or nor will any
attempt be made to contact them.
UNKNOWN Indicates the node's state is undefined (BUSY or
IDLE), but will be established when the slurmd daemon
on that node registers. The default value is
"UNKNOWN".
ThreadsPerCore
Number of logical threads in a single physical core (e.g. "2").
Note that the Slurm can allocate resources to jobs down to the
resolution of a core. If your system is configured with more
than one thread per core, execution of a different job on each
thread is not supported unless you configure
SelectTypeParameters=CR_CPU plus CPUs; do not configure Sockets,
CoresPerSocket or ThreadsPerCore. A job can execute a one task
per thread from within one job step or execute a distinct job
step on each of the threads. Note also if you are running with
more than 1 thread per core and running the select/cons_res
plugin you will want to set the SelectTypeParameters variable to
something other than CR_CPU to avoid unexpected results. The
default value is 1.
TmpDisk
Total size of temporary disk storage in TmpFS in MegaBytes (e.g.
"16384"). TmpFS (for "Temporary File System") identifies the
location which jobs should use for temporary storage. Note this
does not indicate the amount of free space available to the user
on the node, only the total file system size. The system
administration should insure this file system is purged as
needed so that user jobs have access to most of this space. The
Prolog and/or Epilog programs (specified in the configuration
file) might be used to insure the file system is kept clean.
The default value is 0.
Weight The priority of the node for scheduling purposes. All things
being equal, jobs will be allocated the nodes with the lowest
weight which satisfies their requirements. For example, a
heterogeneous collection of nodes might be placed into a single
partition for greater system utilization, responsiveness and
capability. It would be preferable to allocate smaller memory
nodes rather than larger memory nodes if either will satisfy a
job's requirements. The units of weight are arbitrary, but
larger weights should be assigned to nodes with more processors,
memory, disk space, higher processor speed, etc. Note that if a
job allocation request can not be satisfied using the nodes with
the lowest weight, the set of nodes with the next lowest weight
is added to the set of nodes under consideration for use (repeat
as needed for higher weight values). If you absolutely want to
minimize the number of higher weight nodes allocated to a job
(at a cost of higher scheduling overhead), give each node a
distinct Weight value and they will be added to the pool of
nodes being considered for scheduling individually. The default
value is 1.
The "DownNodes=" configuration permits you to mark certain nodes as in
a DOWN, DRAIN, FAIL, or FAILING state without altering the permanent
configuration information listed under a "NodeName=" specification.
DownNodes
Any node name, or list of node names, from the "NodeName="
specifications.
Reason Identifies the reason for a node being in state "DOWN", "DRAIN",
"FAIL" or "FAILING. Use quotes to enclose a reason having more
than one word.
State State of the node with respect to the initiation of user jobs.
Acceptable values are "DOWN", "DRAIN", "FAIL", "FAILING" and
"UNKNOWN". Node states of "BUSY" and "IDLE" should not be
specified in the node configuration, but set the node state to
"UNKNOWN" instead. Setting the node state to "UNKNOWN" will
result in the node state being set to "BUSY", "IDLE" or other
appropriate state based upon recovered system state information.
The default value is "UNKNOWN".
DOWN Indicates the node failed and is unavailable to be
allocated work.
DRAIN Indicates the node is unavailable to be allocated
work.on.
FAIL Indicates the node is expected to fail soon, has no
jobs allocated to it, and will not be allocated to any
new jobs.
FAILING Indicates the node is expected to fail soon, has one
or more jobs allocated to it, but will not be
allocated to any new jobs.
UNKNOWN Indicates the node's state is undefined (BUSY or
IDLE), but will be established when the slurmd daemon
on that node registers. The default value is
"UNKNOWN".
On computers where frontend nodes are used to execute batch scripts
rather than compute nodes (BlueGene or Cray systems), one may configure
one or more frontend nodes using the configuration parameters defined
below. These options are very similar to those used in configuring
compute nodes. These options may only be used on systems configured and
built with the appropriate parameters (--have-front-end,
--enable-bluegene-emulation) or a system determined to have the
appropriate architecture by the configure script (BlueGene or Cray
systems). The front end configuration specifies the following
information:
AllowGroups
Comma separated list of group names which may execute jobs on
this front end node. By default, all groups may use this front
end node. If at least one group associated with the user
attempting to execute the job is in AllowGroups, he will be
permitted to use this front end node. May not be used with the
DenyGroups option.
AllowUsers
Comma separated list of user names which may execute jobs on
this front end node. By default, all users may use this front
end node. May not be used with the DenyUsers option.
DenyGroups
Comma separated list of group names which are prevented from
executing jobs on this front end node. May not be used with the
AllowGroups option.
DenyUsers
Comma separated list of user names which are prevented from
executing jobs on this front end node. May not be used with the
AllowUsers option.
FrontendName
Name that Slurm uses to refer to a frontend node. Typically
this would be the string that "/bin/hostname -s" returns. It
may also be the fully qualified domain name as returned by
"/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
name associated with the host through the host database
(/etc/hosts) or DNS, depending on the resolver settings. Note
that if the short form of the hostname is not used, it may
prevent use of hostlist expressions (the numeric portion in
brackets must be at the end of the string). If the FrontendName
is "DEFAULT", the values specified with that record will apply
to subsequent node specifications unless explicitly set to other
values in that frontend node record or replaced with a different
set of default values. Each line where FrontendName is
"DEFAULT" will replace or add to previous default values and not
a reinitialize the default values. Note that since the naming
of front end nodes would typically not follow that of the
compute nodes (e.g. lacking X, Y and Z coordinates found in the
compute node naming scheme), each front end node name should be
listed separately and without a hostlist expression (i.e.
frontend00,frontend01" rather than "frontend[00-01]").</p>
FrontendAddr
Name that a frontend node should be referred to in establishing
a communications path. This name will be used as an argument to
the gethostbyname() function for identification. As with
FrontendName, list the individual node addresses rather than
using a hostlist expression. The number of FrontendAddr records
per line must equal the number of FrontendName records per line
(i.e. you can't map to node names to one address). FrontendAddr
may also contain IP addresses. By default, the FrontendAddr
will be identical in value to FrontendName.
Port The port number that the Slurm compute node daemon, slurmd,
listens to for work on this particular frontend node. By default
there is a single port number for all slurmd daemons on all
frontend nodes as defined by the SlurmdPort configuration
parameter. Use of this option is not generally recommended
except for development or testing purposes.
Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
automatically try to interact with anything opened on ports
8192-60000. Configure Port to use a port outside of the
configured SrunPortRange and RSIP's port range.
Reason Identifies the reason for a frontend node being in state "DOWN",
"DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to
enclose a reason having more than one word.
State State of the frontend node with respect to the initiation of
user jobs. Acceptable values are "DOWN", "DRAIN", "FAIL",
"FAILING" and "UNKNOWN". "DOWN" indicates the frontend node has
failed and is unavailable to be allocated work. "DRAIN"
indicates the frontend node is unavailable to be allocated work.
"FAIL" indicates the frontend node is expected to fail soon, has
no jobs allocated to it, and will not be allocated to any new
jobs. "FAILING" indicates the frontend node is expected to fail
soon, has one or more jobs allocated to it, but will not be
allocated to any new jobs. "UNKNOWN" indicates the frontend
node's state is undefined (BUSY or IDLE), but will be
established when the slurmd daemon on that node registers. The
default value is "UNKNOWN". Also see the DownNodes parameter
below.
For example: "FrontendName=frontend[00-03]
FrontendAddr=efrontend[00-03] State=UNKNOWN" is used to define
four front end nodes for running slurmd daemons.
The partition configuration permits you to establish different job
limits or access controls for various groups (or partitions) of nodes.
Nodes may be in more than one partition, making partitions serve as
general purpose queues. For example one may put the same set of nodes
into two different partitions, each with different constraints (time
limit, job sizes, groups allowed to use the partition, etc.). Jobs are
allocated resources within a single partition. Default values can be
specified with a record in which PartitionName is "DEFAULT". The
default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where
"PartitionName=DEFAULT". The "PartitionName=" specification must be
placed on every line describing the configuration of partitions. Each
line where PartitionName is "DEFAULT" will replace or add to previous
default values and not a reinitialize the default values. A single
partition name can not appear as a PartitionName value in more than one
line (duplicate partition name records will be ignored). If a
partition that is in use is deleted from the configuration and slurm is
restarted or reconfigured (scontrol reconfigure), jobs using the
partition are canceled. NOTE: Put all parameters for each partition on
a single line. Each line of partition configuration information should
represent a different partition. The partition configuration file
contains the following information:
AllocNodes
Comma separated list of nodes from which users can submit jobs
in the partition. Node names may be specified using the node
range expression syntax described above. The default value is
"ALL".
AllowAccounts
Comma separated list of accounts which may execute jobs in the
partition. The default value is "ALL". NOTE: If AllowAccounts
is used then DenyAccounts will not be enforced. Also refer to
DenyAccounts.
AllowGroups
Comma separated list of group names which may execute jobs in
the partition. If at least one group associated with the user
attempting to execute the job is in AllowGroups, he will be
permitted to use this partition. Jobs executed as user root can
use any partition without regard to the value of AllowGroups.
If user root attempts to execute a job as another user (e.g.
using srun's --uid option), this other user must be in one of
groups identified by AllowGroups for the job to successfully
execute. The default value is "ALL". NOTE: For performance
reasons, Slurm maintains a list of user IDs allowed to use each
partition and this is checked at job submission time. This list
of user IDs is updated when the slurmctld daemon is restarted,
reconfigured (e.g. "scontrol reconfig") or the partition's
AllowGroups value is reset, even if is value is unchanged (e.g.
"scontrol update PartitionName=name AllowGroups=group"). For a
user's access to a partition to change, both his group
membership must change and Slurm's internal user ID list must
change using one of the methods described above.
AllowQos
Comma separated list of Qos which may execute jobs in the
partition. Jobs executed as user root can use any partition
without regard to the value of AllowQos. The default value is
"ALL". NOTE: If AllowQos is used then DenyQos will not be
enforced. Also refer to DenyQos.
Alternate
Partition name of alternate partition to be used if the state of
this partition is "DRAIN" or "INACTIVE."
Default
If this keyword is set, jobs submitted without a partition
specification will utilize this partition. Possible values are
"YES" and "NO". The default value is "NO".
DefMemPerCPU
Default real memory size available per allocated CPU in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. DefMemPerCPU would generally be used if individual
processors are allocated to jobs (SelectType=select/cons_res).
If not set, the DefMemPerCPU value for the entire cluster will
be used. Also see DefMemPerNode and MaxMemPerCPU. DefMemPerCPU
and DefMemPerNode are mutually exclusive. NOTE: Enforcement of
memory limits currently requires enabling of accounting, which
samples memory use on a periodic basis (data need not be stored,
just collected).
DefMemPerNode
Default real memory size available per allocated node in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. DefMemPerNode would generally be used if whole nodes
are allocated to jobs (SelectType=select/linear) and resources
are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
If not set, the DefMemPerNode value for the entire cluster will
be used. Also see DefMemPerCPU and MaxMemPerNode. DefMemPerCPU
and DefMemPerNode are mutually exclusive. NOTE: Enforcement of
memory limits currently requires enabling of accounting, which
samples memory use on a periodic basis (data need not be stored,
just collected).
DenyAccounts
Comma separated list of accounts which may not execute jobs in
the partition. By default, no accounts are denied access NOTE:
If AllowAccounts is used then DenyAccounts will not be enforced.
Also refer to AllowAccounts.
DenyQos
Comma separated list of Qos which may not execute jobs in the
partition. By default, no QOS are denied access NOTE: If
AllowQos is used then DenyQos will not be enforced. Also refer
AllowQos.
DefaultTime
Run time limit used for jobs that don't specify a value. If not
set then MaxTime will be used. Format is the same as for
MaxTime.
DisableRootJobs
If set to "YES" then user root will be prevented from running
any jobs on this partition. The default value will be the value
of DisableRootJobs set outside of a partition specification
(which is "NO", allowing user root to execute jobs).
ExclusiveUser
If set to "YES" then nodes will be exclusively allocated to
users. Multiple jobs may be run for the same user, but only one
user can be active at a time. This capability is also available
on a per-job basis by using the --exclusive=user option.
GraceTime
Specifies, in units of seconds, the preemption grace time to be
extended to a job which has been selected for preemption. The
default value is zero, no preemption grace time is allowed on
this partition. Once a job has been selected for preemption,
it's end time is set to the current time plus GraceTime. The job
is immediately sent SIGCONT and SIGTERM signals in order to
provide notification of its imminent termination. This is
followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
upon reaching its new end time. (Meaningful only for
PreemptMode=CANCEL)
Hidden Specifies if the partition and its jobs are to be hidden by
default. Hidden partitions will by default not be reported by
the Slurm APIs or commands. Possible values are "YES" and "NO".
The default value is "NO". Note that partitions that a user
lacks access to by virtue of the AllowGroups parameter will also
be hidden by default.
LLN Schedule resources to jobs on the least loaded nodes (based upon
the number of idle CPUs). This is generally only recommended for
an environment with serial jobs as idle resources will tend to
be highly fragmented, resulting in parallel jobs being
distributed across many nodes. Also see the SelectParameters
configuration parameter CR_LLN to use the least loaded nodes in
every partition.
MaxCPUsPerNode
Maximum number of CPUs on any node available to all jobs from
this partition. This can be especially useful to schedule GPUs.
For example a node can be associated with two Slurm partitions
(e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
limited to only a subset of the node's CPUs, insuring that one
or more CPUs would be available to jobs in the "gpu"
partition/queue.
MaxMemPerCPU
Maximum real memory size available per allocated CPU in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. MaxMemPerCPU would generally be used if individual
processors are allocated to jobs (SelectType=select/cons_res).
If not set, the MaxMemPerCPU value for the entire cluster will
be used. Also see DefMemPerCPU and MaxMemPerNode. MaxMemPerCPU
and MaxMemPerNode are mutually exclusive. NOTE: Enforcement of
memory limits currently requires enabling of accounting, which
samples memory use on a periodic basis (data need not be stored,
just collected).
MaxMemPerNode
Maximum real memory size available per allocated node in
MegaBytes. Used to avoid over-subscribing memory and causing
paging. MaxMemPerNode would generally be used if whole nodes
are allocated to jobs (SelectType=select/linear) and resources
are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
If not set, the MaxMemPerNode value for the entire cluster will
be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
and MaxMemPerNode are mutually exclusive. NOTE: Enforcement of
memory limits currently requires enabling of accounting, which
samples memory use on a periodic basis (data need not be stored,
just collected).
MaxNodes
Maximum count of nodes which may be allocated to any single job.
For BlueGene systems this will be a c-nodes count and will be
converted to a midplane count with a reduction in resolution.
The default value is "UNLIMITED", which is represented
internally as -1. This limit does not apply to jobs executed by
SlurmUser or user root.
MaxTime
Maximum run time limit for jobs. Format is minutes,
minutes:seconds, hours:minutes:seconds, days-hours,
days-hours:minutes, days-hours:minutes:seconds or "UNLIMITED".
Time resolution is one minute and second values are rounded up
to the next minute. This limit does not apply to jobs executed
by SlurmUser or user root.
MinNodes
Minimum count of nodes which may be allocated to any single job.
For BlueGene systems this will be a c-nodes count and will be
converted to a midplane count with a reduction in resolution.
The default value is 1. This limit does not apply to jobs
executed by SlurmUser or user root.
Nodes Comma separated list of nodes (or base partitions for BlueGene
systems) which are associated with this partition. Node names
may be specified using the node range expression syntax
described above. A blank list of nodes (i.e. "Nodes= ") can be
used if one wants a partition to exist, but have no resources
(possibly on a temporary basis). A value of "ALL" is mapped to
all nodes configured in the cluster.
OverSubscribe
Controls the ability of the partition to execute more than one
job at a time on each resource (node, socket or core depending
upon the value of SelectTypeParameters). If resources are to be
over-subscribed, avoiding memory over-subscription is very
important. SelectTypeParameters should be configured to treat
memory as a consumable resource and the --mem option should be
used for job allocations. Sharing of resources is typically
useful only when using gang scheduling
(PreemptMode=suspend,gang). Possible values for OverSubscribe
are "EXCLUSIVE", "FORCE", "YES", and "NO". Note that a value of
"YES" or "FORCE" can negatively impact performance for systems
with many thousands of running jobs. The default value is "NO".
For more information see the following web pages:
http://slurm.schedmd.com/cons_res.html,
http://slurm.schedmd.com/cons_res_share.html,
http://slurm.schedmd.com/gang_scheduling.html, and
http://slurm.schedmd.com/preempt.html.
EXCLUSIVE Allocates entire nodes to jobs even with
select/cons_res configured. Jobs that run in
partitions with "OverSubscribe=EXCLUSIVE" will have
exclusive access to all allocated nodes.
FORCE Makes all resources in the partition available for
sharing without any means for users to disable it.
May be followed with a colon and maximum number of
jobs in running or suspended state. For example
"OverSubscribe=FORCE:4" enables each node, socket or
core to execute up to four jobs at once.
Recommended only for BlueGene systems configured
with small blocks or for systems running with gang
scheduling (PreemptMode=suspend,gang). NOTE:
PreemptType=QOS will permit one additional job to be
run on the partition if started due to job
preemption. For example, a configuration of
OverSubscribe=FORCE:1 will only permit one job per
resources normally, but a second job can be started
if done so through preemption based upon QOS. The
use of PreemptType=QOS and PreemptType=Suspend only
applies with SelectType=cons_res.
YES Makes all resources in the partition available for
sharing upon request by the job. Resources will
only be over-subscribed when explicitly requested by
the user using the "--share" option on job
submission. May be followed with a colon and
maximum number of jobs in running or suspended
state. For example "OverSubscribe=YES:4" enables
each node, socket or core to execute up to four jobs
at once. Recommended only for systems running with
gang scheduling (PreemptMode=suspend,gang).
NO Selected resources are allocated to a single job. No
resource will be allocated to more than one job.
PartitionName
Name by which the partition may be referenced (e.g.
"Interactive"). This name can be specified by users when
submitting jobs. If the PartitionName is "DEFAULT", the values
specified with that record will apply to subsequent partition
specifications unless explicitly set to other values in that
partition record or replaced with a different set of default
values. Each line where PartitionName is "DEFAULT" will replace
or add to previous default values and not a reinitialize the
default values.
PreemptMode
Mechanism used to preempt jobs from this partition when
PreemptType=preempt/partition_prio is configured. This
partition specific PreemptMode configuration parameter will
override the PreemptMode configuration parameter set for the
cluster as a whole. The cluster-level PreemptMode must include
the GANG option if PreemptMode is configured to SUSPEND for any
partition. The cluster-level PreemptMode must not be OFF if
PreemptMode is enabled for any partition. See the description
of the cluster-level PreemptMode configuration parameter above
for further information.
PriorityJobFactor
Partition factor used by priority/multifactor plugin in
calculating job priority. The value may not exceed 65533. Also
see PriorityTier.
PriorityTier
Jobs submitted to a partition with a higher priority tier value
will be dispatched before pending jobs in partition with lower
priority tier value and, if possible, they will preempt running
jobs from partitions with lower priority tier values. Note that
a partition's priority tier takes precedence over a job's
priority. The value may not exceed 65533. Also see
PriorityJobFactor.
QOS Used to extend the limits available to a QOS on a partition.
Jobs will not be associated to this QOS outside of being
associated to the partition. They will still be associated to
their requested QOS. By default, no QOS is used. NOTE: If a
limit is set in both the Partition's QOS and the Job's QOS the
Partition QOS will be honored unless the Job's QOS has the
OverPartQOS flag set in which the Job's QOS will have priority.
ReqResv
Specifies users of this partition are required to designate a
reservation when submitting a job. This option can be useful in
restricting usage of a partition that may have higher priority
or additional resources to be allowed only within a reservation.
Possible values are "YES" and "NO". The default value is "NO".
RootOnly
Specifies if only user ID zero (i.e. user root) may allocate
resources in this partition. User root may allocate resources
for any other user, but the request must be initiated by user
root. This option can be useful for a partition to be managed
by some external entity (e.g. a higher-level job manager) and
prevents users from directly using those resources. Possible
values are "YES" and "NO". The default value is "NO".
SelectTypeParameters
Partition-specific resource allocation type. This option
replaces the global SelectTypeParameters value. Supported
values are CR_Core, CR_Core_Memory, CR_Socket and
CR_Socket_Memory. Use requires the system-wide
SelectTypeParameters value be set.
Shared The Shared configuration parameter has been replaced by the
OverSubscribe parameter described above.
State State of partition or availability for use. Possible values are
"UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
See also the related "Alternate" keyword.
UP Designates that new jobs may queued on the partition,
and that jobs may be allocated nodes and run from the
partition.
DOWN Designates that new jobs may be queued on the
partition, but queued jobs may not be allocated nodes
and run from the partition. Jobs already running on
the partition continue to run. The jobs must be
explicitly canceled to force their termination.
DRAIN Designates that no new jobs may be queued on the
partition (job submission requests will be denied with
an error message), but jobs already queued on the
partition may be allocated nodes and run. See also
the "Alternate" partition specification.
INACTIVE Designates that no new jobs may be queued on the
partition, and jobs already queued may not be
allocated nodes and run. See also the "Alternate"
partition specification.
TRESBillingWeights
TRESBillingWeights is used to define the billing weights of each
TRES type that will be used in calcuating the usage of a job.
Billing weights are specified as a comma-separated list of <TRES
Type>=<TRES Billing Weight> pairs.
Any TRES Type is available for billing. Note that base the unit
for memory and burst buffers is megabytes.
By default the billing of TRES is calculated as the sum of all
TRES types multiplied by their corresponding billing weight.
The weighted amount of a resource can be adjusted by adding a
suffix of K,M,G,T or P after the billing weight. For example, a
memory weight of "mem=.25" on a job allocated 8GB will be billed
2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
same job will be billed 2 (8192MB * (.25/1024)) units.
When a job is allocated 1 CPU and 8 GB of memory on a partition
configured with
TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the
billable TRES will be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
If PriorityFlags=MAX_TRES is configured, the billable TRES is
calculated as the MAX of individual TRES' on a node (e.g. cpus,
mem, gres) plus the sum of all global TRES' (e.g. licenses).
Using the same example above the billable TRES will be
MAX(1*1.0, 8*0.25) + (0*2.0) = 2.0.
If TRESBillingWeights is not defined then the job is billed
against the total number of allocated CPUs.
NOTE: TRESBillingWeights is only used when calcuating fairshare
and doesn't affect job priority directly as it is currently not
used for the size of the job. If you want TRES' to play a role
in the job's priority then refer to the PriorityWeightTRES
option.
There are a variety of prolog and epilog program options that execute
with various permissions and at various times. The four options most
likely to be used are: Prolog and Epilog (executed once on each compute
node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
once on the ControlMachine for each job).
NOTE: Standard output and error messages are normally not preserved.
Explicitly write output and error messages to an appropriate location
if you wish to preserve that information.
NOTE: By default the Prolog script is ONLY run on any individual node
when it first sees a job step from a new allocation; it does not run
the Prolog immediately when an allocation is granted. If no job steps
from an allocation are run on a node, it will never run the Prolog for
that allocation. This Prolog behaviour can be changed by the
PrologFlags parameter. The Epilog, on the other hand, always runs on
every node of an allocation when the allocation is released.
If the Epilog fails (returns a non-zero exit code), this will result in
the node being set to a DRAIN state. If the EpilogSlurmctld fails
(returns a non-zero exit code), this will only be logged. If the
Prolog fails (returns a non-zero exit code), this will result in the
node being set to a DRAIN state and the job being requeued in a held
state unless nohold_on_prolog_fail is configured in
SchedulerParameters. If the PrologSlurmctld fails (returns a non-zero
exit code), this will result in the job requeued to executed on another
node if possible. Only batch jobs can be requeued.
Interactive jobs (salloc and srun) will be cancelled if the
PrologSlurmctld fails.
Information about the job is passed to the script using environment
variables. Unless otherwise specified, these environment variables are
available to all of the programs.
BASIL_RESERVATION_ID
Basil reservation ID. Available on Cray systems with ALPS only.
MPIRUN_PARTITION
BlueGene partition name. Available on BlueGene systems only.
SLURM_ARRAY_JOB_ID
If this job is part of a job array, this will be set to the job
ID. Otherwise it will not be set. To reference this specific
task of a job array, combine SLURM_ARRAY_JOB_ID with
SLURM_ARRAY_TASK_ID (e.g. "scontrol update
${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
PrologSlurmctld and EpilogSlurmctld only.
SLURM_ARRAY_TASK_ID
If this job is part of a job array, this will be set to the task
ID. Otherwise it will not be set. To reference this specific
task of a job array, combine SLURM_ARRAY_JOB_ID with
SLURM_ARRAY_TASK_ID (e.g. "scontrol update
${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
PrologSlurmctld and EpilogSlurmctld only.
SLURM_ARRAY_TASK_MAX
If this job is part of a job array, this will be set to the
maximum task ID. Otherwise it will not be set. Available in
PrologSlurmctld and EpilogSlurmctld only.
SLURM_ARRAY_TASK_MIN
If this job is part of a job array, this will be set to the
minimum task ID. Otherwise it will not be set. Available in
PrologSlurmctld and EpilogSlurmctld only.
SLURM_ARRAY_TASK_STEP
If this job is part of a job array, this will be set to the step
size of task IDs. Otherwise it will not be set. Available in
PrologSlurmctld and EpilogSlurmctld only.
SLURM_CLUSTER_NAME
Name of the cluster executing the job.
SLURM_JOB_ACCOUNT
Account name used for the job. Available in PrologSlurmctld and
EpilogSlurmctld only.
SLURM_JOB_CONSTRAINTS
Features required to run the job. Available in Prolog,
PrologSlurmctld and EpilogSlurmctld only.
SLURM_JOB_DERIVED_EC
The highest exit code of all of the job steps. Available in
EpilogSlurmctld only.
SLURM_JOB_EXIT_CODE
The exit code of the job script (or salloc). The value is the
status as returned by the wait() system call (See wait(2))
Available in EpilogSlurmctld only.
SLURM_JOB_EXIT_CODE2
The exit code of the job script (or salloc). The value has the
format <exit>:<sig>. The first number is the exit code,
typically as set by the exit() function. The second number of
the signal that caused the process to terminante if it was
terminated by a signal. Available in EpilogSlurmctld only.
SLURM_JOB_GID
Group ID of the job's owner. Available in PrologSlurmctld and
EpilogSlurmctld only.
SLURM_JOB_GPUS
GPU IDs allocated to the job (if any). Available in the Prolog
only.
SLURM_JOB_GROUP
Group name of the job's owner. Available in PrologSlurmctld and
EpilogSlurmctld only.
SLURM_JOB_ID
Job ID. CAUTION: If this job is the first task of a job array,
then Slurm commands using this job ID will refer to the entire
job array rather than this specific task of the job array.
SLURM_JOB_NAME
Name of the job. Available in PrologSlurmctld and
EpilogSlurmctld only.
SLURM_JOB_NODELIST
Nodes assigned to job. A Slurm hostlist expression. "scontrol
show hostnames" can be used to convert this to a list of
individual host names. Available in PrologSlurmctld and
EpilogSlurmctld only.
SLURM_JOB_PARTITION
Partition that job runs in. Available in Prolog,
PrologSlurmctld and EpilogSlurmctld only.
SLURM_JOB_UID
User ID of the job's owner.
SLURM_JOB_USER
User name of the job's owner.
Slurm is able to optimize job allocations to minimize network contention. Special Slurm logic is used to optimize allocations on systems with a three-dimensional interconnect (BlueGene, etc.) and information about configuring those systems are available on web pages available here: <http://slurm.schedmd.com/>. For a hierarchical network, Slurm needs to have detailed information about how nodes are configured on the network switches. Given network topology information, Slurm allocates all of a job's resources onto a single leaf of the network (if possible) using a best-fit algorithm. Otherwise it will allocate a job's resources onto multiple leaf switches so as to minimize the use of higher-level switches. The TopologyPlugin parameter controls which plugin is used to collect network topology information. The only values presently supported are "topology/3d_torus" (default for IBM BlueGene and Cray XT/XE systems, performs best-fit logic over three-dimensional topology), "topology/none" (default for other systems, best-fit logic over one-dimensional topology), "topology/tree" (determine the network topology based upon information contained in a topology.conf file, see "man topology.conf" for more information). Future plugins may gather topology information directly from the network. The topology information is optional. If not provided, Slurm will perform a best-fit algorithm assuming the nodes are in a one-dimensional array as configured and the communications cost is related to the node distance in this array.
If the cluster's computers used for the primary or backup controller will be out of service for an extended period of time, it may be desirable to relocate them. In order to do so, follow this procedure: 1. Stop the Slurm daemons 2. Modify the slurm.conf file appropriately 3. Distribute the updated slurm.conf file to all nodes 4. Restart the Slurm daemons There should be no loss of any running or pending jobs. Insure that any nodes added to the cluster have the current slurm.conf file installed. CAUTION: If two nodes are simultaneously configured as the primary controller (two nodes on which ControlMachine specify the local host and the slurmctld daemon is executing on each), system behavior will be destructive. If a compute node has an incorrect ControlMachine or BackupController parameter, that node may be rendered unusable, but no other harm will result.
# # Sample /etc/slurm.conf for dev[0-25].llnl.gov # Author: John Doe # Date: 11/06/2001 # ControlMachine=dev0 ControlAddr=edev0 BackupController=dev1 BackupAddr=edev1 # AuthType=auth/munge Epilog=/usr/local/slurm/epilog Prolog=/usr/local/slurm/prolog FastSchedule=1 FirstJobId=65536 InactiveLimit=120 JobCompType=jobcomp/filetxt JobCompLoc=/var/log/slurm/jobcomp KillWait=30 MaxJobCount=10000 MinJobAge=3600 PluginDir=/usr/local/lib:/usr/local/slurm/lib ReturnToService=0 SchedulerType=sched/backfill SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log SlurmctldPort=7002 SlurmdPort=7003 SlurmdSpoolDir=/usr/local/slurm/slurmd.spool StateSaveLocation=/usr/local/slurm/slurm.state SwitchType=switch/none TmpFS=/tmp WaitTime=30 JobCredentialPrivateKey=/usr/local/slurm/private.key JobCredentialPublicCertificate=/usr/local/slurm/public.cert # # Node Configurations # NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000 NodeName=DEFAULT State=UNKNOWN NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16 # Update records for specific DOWN nodes DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25" # # Partition Configurations # PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP PartitionName=debug Nodes=dev[0-8,18-25] Default=YES PartitionName=batch Nodes=dev[9-17] MinNodes=4 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
The "include" key word can be used with modifiers within the specified pathname. These modifiers would be replaced with cluster name or other information depending on which modifier is specified. If the included file is not an absolute path name (i.e. it does not start with a slash), it will searched for in the same directory as the slurm.conf file. %c Cluster name specified in the slurm.conf will be used. EXAMPLE ClusterName=linux include /home/slurm/etc/%c_config # Above line interpreted as # "include /home/slurm/etc/linux_config"
There are three classes of files: Files used by slurmctld must be
accessible by user SlurmUser and accessible by the primary and backup
control machines. Files used by slurmd must be accessible by user root
and accessible from every compute node. A few files need to be
accessible by normal users on all login and compute nodes. While many
files and directories are listed below, most of them will not be used
with most configurations.
AccountingStorageLoc
If this specifies a file, it must be writable by user SlurmUser.
The file must be accessible by the primary and backup control
machines. It is recommended that the file be readable by all
users from login and compute nodes.
Epilog Must be executable by user root. It is recommended that the
file be readable by all users. The file must exist on every
compute node.
EpilogSlurmctld
Must be executable by user SlurmUser. It is recommended that
the file be readable by all users. The file must be accessible
by the primary and backup control machines.
HealthCheckProgram
Must be executable by user root. It is recommended that the
file be readable by all users. The file must exist on every
compute node.
JobCheckpointDir
Must be writable by user SlurmUser and no other users. The file
must be accessible by the primary and backup control machines.
JobCompLoc
If this specifies a file, it must be writable by user SlurmUser.
The file must be accessible by the primary and backup control
machines.
JobCredentialPrivateKey
Must be readable only by user SlurmUser and writable by no other
users. The file must be accessible by the primary and backup
control machines.
JobCredentialPublicCertificate
Readable to all users on all nodes. Must not be writable by
regular users.
MailProg
Must be executable by user SlurmUser. Must not be writable by
regular users. The file must be accessible by the primary and
backup control machines.
Prolog Must be executable by user root. It is recommended that the
file be readable by all users. The file must exist on every
compute node.
PrologSlurmctld
Must be executable by user SlurmUser. It is recommended that
the file be readable by all users. The file must be accessible
by the primary and backup control machines.
ResumeProgram
Must be executable by user SlurmUser. The file must be
accessible by the primary and backup control machines.
SallocDefaultCommand
Must be executable by all users. The file must exist on every
login and compute node.
slurm.conf
Readable to all users on all nodes. Must not be writable by
regular users.
SlurmctldLogFile
Must be writable by user SlurmUser. The file must be accessible
by the primary and backup control machines.
SlurmctldPidFile
Must be writable by user root. Preferably writable and
removable by SlurmUser. The file must be accessible by the
primary and backup control machines.
SlurmdLogFile
Must be writable by user root. A distinct file must exist on
each compute node.
SlurmdPidFile
Must be writable by user root. A distinct file must exist on
each compute node.
SlurmdSpoolDir
Must be writable by user root. A distinct file must exist on
each compute node.
SrunEpilog
Must be executable by all users. The file must exist on every
login and compute node.
SrunProlog
Must be executable by all users. The file must exist on every
login and compute node.
StateSaveLocation
Must be writable by user SlurmUser. The file must be accessible
by the primary and backup control machines.
SuspendProgram
Must be executable by user SlurmUser. The file must be
accessible by the primary and backup control machines.
TaskEpilog
Must be executable by all users. The file must exist on every
compute node.
TaskProlog
Must be executable by all users. The file must exist on every
compute node.
UnkillableStepProgram
Must be executable by user SlurmUser. The file must be
accessible by the primary and backup control machines.
Note that while Slurm daemons create log files and other files as
needed, it treats the lack of parent directories as a fatal error.
This prevents the daemons from running if critical file systems are not
mounted and will minimize the risk of cold-starting (starting without
preserving jobs).
Log files and job accounting files, may need to be created/owned by the
"SlurmUser" uid to be successfully accessed. Use the "chown" and
"chmod" commands to set the ownership and permissions appropriately.
See the section FILE AND DIRECTORY PERMISSIONS for information about
the various files and directories used by Slurm.
It is recommended that the logrotate utility be used to insure that
various log files do not become too large. This also applies to text
files used for accounting, process tracking, and the slurmdbd log if
they are used.
Here is a sample logrotate configuration. Make appropriate site
modifications and save as /etc/logrotate.d/slurm on all nodes. See the
logrotate man page for more details.
##
# Slurm Logrotate Configuration
##
/var/log/slurm/*log {
compress
missingok
nocopytruncate
nocreate
nodelaycompress
nomail
notifempty
noolddir
rotate 5
sharedscripts
size=5M
create 640 slurm root
postrotate
/etc/init.d/slurm reconfig
endscript
}
Copyright (C) 2002-2007 The Regents of the University of California. Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). Copyright (C) 2008-2010 Lawrence Livermore National Security. Copyright (C) 2010-2016 SchedMD LLC. This file is part of Slurm, a resource management program. For details, see <http://slurm.schedmd.com/>. Slurm is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
/etc/slurm.conf
bluegene.conf(5), cgroup.conf(5), gethostbyname (3), getrlimit (2), gres.conf(5), group (5), hostname (1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8), slurmdbd.conf(5), srun(1), spank(8), syslog (2), topology.conf(5), wiki.conf(5)
Personal Opportunity - Free software gives you access to billions of dollars of software at no cost. Use this software for your business, personal use or to develop a profitable skill. Access to source code provides access to a level of capabilities/information that companies protect though copyrights. Open source is a core component of the Internet and it is available to you. Leverage the billions of dollars in resources and capabilities to build a career, establish a business or change the world. The potential is endless for those who understand the opportunity.
Business Opportunity - Goldman Sachs, IBM and countless large corporations are leveraging open source to reduce costs, develop products and increase their bottom lines. Learn what these companies know about open source and how open source can give you the advantage.
Free Software provides computer programs and capabilities at no cost but more importantly, it provides the freedom to run, edit, contribute to, and share the software. The importance of free software is a matter of access, not price. Software at no cost is a benefit but ownership rights to the software and source code is far more significant.
Free Office Software - The Libre Office suite provides top desktop productivity tools for free. This includes, a word processor, spreadsheet, presentation engine, drawing and flowcharting, database and math applications. Libre Office is available for Linux or Windows.
The Free Books Library is a collection of thousands of the most popular public domain books in an online readable format. The collection includes great classical literature and more recent works where the U.S. copyright has expired. These books are yours to read and use without restrictions.
Source Code - Want to change a program or know how it works? Open Source provides the source code for its programs so that anyone can use, modify or learn how to write those programs themselves. Visit the GNU source code repositories to download the source.
Study at Harvard, Stanford or MIT - Open edX provides free online courses from Harvard, MIT, Columbia, UC Berkeley and other top Universities. Hundreds of courses for almost all major subjects and course levels. Open edx also offers some paid courses and selected certifications.
Linux Manual Pages - A man or manual page is a form of software documentation found on Linux/Unix operating systems. Topics covered include computer programs (including library and system calls), formal standards and conventions, and even abstract concepts.