opensm(8)

NAME

   opensm - InfiniBand subnet manager and administration (SM/SA)

SYNOPSIS

   opensm  [--version]]  [-F  |  --config  <file_name>]  [-c(reate-config)
   <file_name>]  [-g(uid)  <GUID  in  hex>]  [-l(mc)  <LMC>]  [-p(riority)
   <PRIORITY>] [--smkey <SM_Key>] [--sm_sl <SL number>] [-r(eassign_lids)]
   [-R   <engine   name(s)>   |   --routing_engine    <engine    name(s)>]
   [--do_mesh_analysis] [--lash_start_vl <vl number>] [-A | --ucast_cache]
   [-z | --connect_roots] [-M <file name> | --lid_matrix_file <file name>]
   [-U  <file  name>  |  --lfts_file  <file name>] [-S | --sadb_file <file
   name>] [-a | --root_guid_file <path  to  file>]  [-u  |  --cn_guid_file
   <path  to file>] [-G | --io_guid_file <path to file>] [--port-shifting]
   [--scatter-ports <random seed>] [-H | --max_reverse_hops  <max  reverse
   hops  allowed>]  [-X  | --guid_routing_order_file <path to file>] [-m |
   --ids_guid_file  <path  to  file>]  [-o(nce)]   [-s(weep)   <interval>]
   [-t(imeout)  <milliseconds>]  [--retries <number>] [--maxsmps <number>]
   [--console [off | local | socket | loopback]]  [--console-port  <port>]
   [-i    |    --ignore_guids    <equalize-ignore-guids-file>]    [-w    |
   --hop_weights_file <path to file>]  [-O  |  --port_search_ordering_file
   <path  to  file>]  [-O | --dimn_ports_file <path to file>] (DEPRECATED)
   [-f <log file path> | --log_file <log file path> ]  [-L  |  --log_limit
   <size in MB>] [-e(rase_log_file)] [-P(config) <partition config file> ]
   [-N | --no_part_enforce] (DEPRECATED) [-Z | --part_enforce [both | in |
   out   |   off]]   [-W   |   --allow_both_pkeys]   [-Q  |  --qos  [-Y  |
   --qos_policy_file <file name>]] [--congestion-control] [--cckey  <key>]
   [-y | --stay_on_fatal] [-B | --daemon] [-J | --pidfile <file_name>] [-I
   |   --inactive]    [--perfmgr]    [--perfmgr_sweep_time_s    <seconds>]
   [--prefix_routes_file        <path>]       [--consolidate_ipv6_snm_req]
   [--log_prefix  <prefix   text>]   [--torus_config   <path   to   file>]
   [-v(erbose)] [-V] [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]

DESCRIPTION

   opensm  is  an  InfiniBand compliant Subnet Manager and Administration,
   and runs on top of OpenIB.

   opensm provides an implementation of an InfiniBand Subnet  Manager  and
   Administration.  Such a software entity is required to run for in order
   to initialize the InfiniBand hardware (at least one per each InfiniBand
   subnet).

   opensm  also  now  contains  an  experimental  version of a performance
   manager as well.

   opensm defaults were designed to meet the common case usage on clusters
   with up to a few hundred nodes. Thus, in this default mode, opensm will
   scan the IB fabric, initialize it, and sweep occasionally for changes.

   opensm attaches to  a  specific  IB  port  on  the  local  machine  and
   configures  only  the fabric connected to it. (If the local machine has
   other IB ports, opensm will ignore the fabrics connected to those other
   ports).  If  no  port  is  specified,  it  will select the first "best"
   available port.

   opensm can present the available ports and prompt for a port number  to
   attach to.

   By  default,  the  run  is  logged  to two files: /var/log/messages and
   /var/log/opensm.log.  The first file will register only  general  major
   events, whereas the second will include details of reported errors. All
   errors reported in this second file should be treated as indicators  of
   IB  fabric  health issues.  (Note that when a fatal and non-recoverable
   error occurs, opensm will exit.)  Both log  files  should  include  the
   message "SUBNET UP" if opensm was able to setup the subnet correctly.

OPTIONS

   --version
          Prints OpenSM version and exits.

   -F, --config <config file>
          The  name  of  the  OpenSM  config  file.  When  not  specified
          /etc/opensm/opensm.conf will be used (if exists).

   -c, --create-config <file name>
          OpenSM will dump its configuration to  the  specified  file  and
          exit.   This  is  a  way  to  generate OpenSM configuration file
          template.

   -g, --guid <GUID in hex>
          This option specifies the  local  port  GUID  value  with  which
          OpenSM  should  bind.   OpenSM may be bound to 1 port at a time.
          If GUID given is 0, OpenSM displays  a  list  of  possible  port
          GUIDs and waits for user input.  Without -g, OpenSM tries to use
          the default port.

   -l, --lmc <LMC value>
          This option specifies the subnet's LMC  value.   The  number  of
          LIDs  assigned  to each port is 2^LMC.  The LMC value must be in
          the range 0-7.  LMC values >  0  allow  multiple  paths  between
          ports.   LMC  values  >  0  should  only  be  used if the subnet
          topology actually provides multiple paths  between  ports,  i.e.
          multiple  interconnects  between  switches.   Without -l, OpenSM
          defaults to LMC = 0, which  allows  one  path  between  any  two
          ports.

   -p, --priority <Priority value>
          This  option  specifies the SMs PRIORITY.  This will effect the
          handover cases, where master is chosen  by  priority  and  GUID.
          Range goes from 0 (default and lowest priority) to 15 (highest).

   --smkey <SM_Key value>
          This  option  specifies  the  SMs  SM_Key (64 bits).  This will
          effect SM authentication.  Note that OpenSM  version  3.2.1  and
          below  used  the  default  value '1' in a host byte order, it is
          fixed now but you may need this option to interoperate with  old
          OpenSM running on a little endian machine.

   --sm_sl <SL number>
          This option sets the SL to use for communication with the SM/SA.
          Defaults to 0.

   -r, --reassign_lids
          This option causes OpenSM to reassign LIDs  to  all  end  nodes.
          Specifying  -r  on  a running subnet may disrupt subnet traffic.
          Without -r, OpenSM attempts to preserve existing LID assignments
          resolving multiple use of same LID.

   -R, --routing_engine <Routing engine names>
          This  option chooses routing engine(s) to use instead of Min Hop
          algorithm (default).  Multiple routing engines can be  specified
          separated  by  commas  so  that  specific  ordering  of  routing
          algorithms will be tried if earlier routing  engines  fail.   If
          all  configured routing engines fail, OpenSM will always attempt
          to route with Min Hop unless 'no_fallback' is  included  in  the
          list of routing engines.  Supported engines: minhop, updn, dnup,
          file, ftree, lash, dor, torus-2QoS, dfsssp, sssp.

   --do_mesh_analysis
          This option enables additional analysis  for  the  lash  routing
          engine  to  precondition  switch  port  assignments  in  regular
          cartesian meshes which may reduce the number of SLs required  to
          give a deadlock free routing.

   --lash_start_vl <vl number>
          This  option  sets  the  starting VL to use for the lash routing
          algorithm.  Defaults to 0.

   -A, --ucast_cache
          This option enables unicast routing cache and  prevents  routing
          recalculation  (which  is  a heavy task in a large cluster) when
          there was no topology change detected during the heavy sweep, or
          when   the   topology   change  does  not  require  new  routing
          calculation, e.g. when one or more CAs/RTRs/leaf switches  going
          down,  or  one  or  more  of these nodes coming back after being
          down.  A very common case that is handled by the unicast routing
          cache  is  host  reboot,  which  otherwise  would cause two full
          routing recalculations: one when the host  goes  down,  and  the
          other when the host comes back online.

   -z, --connect_roots
          This  option  enforces routing engines (up/down and fat-tree) to
          make connectivity between root switches and in this  way  to  be
          fully  IBA  compliant.  In  many  cases  this can violate "pure"
          deadlock free algorithm, so use it carefully.

   -M, --lid_matrix_file <file name>
          This option specifies the name of the lid matrix dump file  from
          where switch lid matrices (min hops tables) will be loaded.

   -U, --lfts_file <file name>
          This  option  specifies  the  name  of  the LFTs file from where
          switch forwarding  tables  will  be  loaded  when  using  "file"
          routing engine.

   -S, --sadb_file <file name>
          This option specifies the name of the SA DB dump file from where
          SA database will be loaded.

   -a, --root_guid_file <file name>
          Set the root nodes for the Up/Down or Fat-Tree routing algorithm
          to the guids provided in the given file (one to a line).

   -u, --cn_guid_file <file name>
          Set  the  compute  nodes for the Fat-Tree or DFSSSP/SSSP routing
          algorithms to the port GUIDs provided in the given file (one  to
          a line).

   -G, --io_guid_file <file name>
          Set  the  I/O  nodes  for  the  Fat-Tree  or DFSSSP/SSSP routing
          algorithms to the port GUIDs provided in the given file (one  to
          a line).
          In the case of Fat-Tree routing:
          I/O nodes are non-CN nodes allowed to use up to max_reverse_hops
          switches the wrong way around to improve connectivity.
          In the case of (DF)SSSP routing:
          Providing guids of compute and/or I/O  nodes  will  ensure  that
          paths  towards  those  nodes  are  as much separated as possible
          within their node category, i.e., I/O traffic will not share the
          same link if multiple links are available.

   --port-shifting
          This  option  enables  a  feature called port shifting.  In some
          fabrics,  particularly  cluster  environments,  routes  commonly
          align  and  congest  with  other  routes  due to algorithmically
          unchanging traffic patterns.  This routing option  will  "shift"
          routing around in an attempt to alleviate this problem.

   --scatter-ports <random seed>
          This  option  is  used  to  randomize  port selection in routing
          rather  than  using  a  round-robin  algorithm  (which  is   the
          default).  Value  supplied with option is used as a random seed.
          If value is 0, which is the default, the scatter ports option is
          disabled.

   -H, --max_reverse_hops <max reverse hops allowed>
          Set the maximum number of reverse hops an I/O node is allowed to
          make. A reverse hop is the use of a switch the wrong way around.

   -m, --ids_guid_file <file name>
          Name of the map file with set of the IDs which will be  used  by
          Up/Down  routing algorithm instead of node GUIDs (format: <guid>
          <id> per line).

   -X, --guid_routing_order_file <file name>
          Set the order port guids will  be  routed  for  the  MinHop  and
          Up/Down  routing  algorithms  to the guids provided in the given
          file (one to a line).

   -o, --once
          This option causes OpenSM to configure  the  subnet  once,  then
          exit.  Ports remain in the ACTIVE state.

   -s, --sweep <interval value>
          This  option  specifies  the  number  of  seconds between subnet
          sweeps.  Specifying -s 0 disables sweeping.  Without -s,  OpenSM
          defaults to a sweep interval of 10 seconds.

   -t, --timeout <value>
          This   option  specifies  the  time  in  milliseconds  used  for
          transaction timeouts.  Timeout values should be  >  0.   Without
          -t, OpenSM defaults to a timeout value of 200 milliseconds.

   --retries <number>
          This   option   specifies   the   number  of  retries  used  for
          transactions.  Without --retries, OpenSM defaults to  3  retries
          for transactions.

   --maxsmps <number>
          This option specifies the number of VL15 SMP MADs allowed on the
          wire at any one time.  Specifying --maxsmps 0  allows  unlimited
          outstanding  SMPs.   Without  --maxsmps,  OpenSM  defaults  to a
          maximum of 4 outstanding SMPs.

   --console [off | local | loopback | socket]
          This option brings up the OpenSM console (default  off).   Note,
          loopback  and  socket  open  a  socket which can be connected to
          WITHOUT CREDENTIALS.  Loopback is safer if  access  to  your  SM
          host  is  controlled.  tcp_wrappers (hosts.[allow|deny]) is used
          with loopback and socket.  loopback  and  socket  will  only  be
          available  if  OpenSM  was  built with --enable-console-loopback
          (default   yes)   and   --enable-console-socket   (default   no)
          respectively.

   --console-port <port>
          Specify an alternate telnet port for the socket console (default
          10000).  Note that this option only appears if OpenSM was  built
          with --enable-console-socket.

   -i, --ignore_guids <equalize-ignore-guids-file>
          This option provides the means to define a set of ports (by node
          guid and port number) that will be  ignored  by  the  link  load
          equalization algorithm.

   -w, --hop_weights_file <path to file>
          This  option  provides weighting factors per port representing a
          hop cost in computing the lid  matrix.   The  file  consists  of
          lines  containing  a switch port GUID (specified as a 64 bit hex
          number, with leading 0x),  output  port  number,  and  weighting
          factor.  Any port not listed in the file defaults to a weighting
          factor of 1.  Lines  starting  with  #  are  comments.   Weights
          affect  only  the  output  route  from  the port, so many useful
          configurations will require weights to be specified in pairs.

   -O, --port_search_ordering_file <path to file>
          This option tweaks the routing. It suitable for  two  cases:  1.
          While  using  DOR  routing  algorithm.   This  option provides a
          mapping between hypercube dimensions and ports on a  per  switch
          basis  for  the  DOR routing engine.  The file consists of lines
          containing a switch node GUID (specified as a 64 bit hex number,
          with  leading  0x)  followed by a list of non-zero port numbers,
          separated by spaces, one switch per line.   The  order  for  the
          port  numbers is in one to one correspondence to the dimensions.
          Ports not listed  on  a  line  are  assigned  to  the  remaining
          dimensions, in port order.  Anything after a # is a comment.  2.
          While using general routing algorithm.  This option provides the
          order  of  the ports that would be chosen for routing, from each
          switch rather than searching for an appropriate port from port 1
          to  N.  The file consists of lines containing a switch node GUID
          (specified as a 64 bit hex number, with leading 0x) followed  by
          a list of non-zero port numbers, separated by spaces, one switch
          per line.  In case of DOR, the order for the port numbers is  in
          one  to  one correspondence to the dimensions.  Ports not listed
          on a line are assigned to  the  remaining  dimensions,  in  port
          order.  Anything after a # is a comment.

   -O, --dimn_ports_file <path to file> (DEPRECATED)
          This      is      a      deprecated     flag.     Please     use
          --port_search_ordering_file instead.   This  option  provides  a
          mapping  between  hypercube dimensions and ports on a per switch
          basis for the DOR routing engine.  The file  consists  of  lines
          containing a switch node GUID (specified as a 64 bit hex number,
          with leading 0x) followed by a list of  non-zero  port  numbers,
          separated  by  spaces,  one  switch per line.  The order for the
          port numbers is in one to one correspondence to the  dimensions.
          Ports  not  listed  on  a  line  are  assigned  to the remaining
          dimensions, in port order.  Anything after a # is a comment.

   -x, --honor_guid2lid
          This option forces OpenSM to honor the guid2lid  file,  when  it
          comes   out   of  Standby  state,  if  such  file  exists  under
          OSM_CACHE_DIR, and is valid.  By default, this is FALSE.

   -f, --log_file <file name>
          This option defines the log to be the given file.   By  default,
          the  log  goes  to  /var/log/opensm.log.   For  the log to go to
          standard output use -f stdout.

   -L, --log_limit <size in MB>
          This option defines maximal log file size in MB. When  specified
          the log file will be truncated upon reaching this limit.

   -e, --erase_log_file
          This  option  will  cause  deletion  of  the  log  file  (if  it
          previously exists). By default, the log file is accumulative.

   -P, --Pconfig <partition config file>
          This option defines the optional partition  configuration  file.
          The default name is /etc/opensm/partitions.conf.

   --prefix_routes_file <file name>
          Prefix routes control how the SA responds to path record queries
          for off-subnet DGIDs.  By default, the SA  fails  such  queries.
          The  PREFIX  ROUTES  section  below  describes the format of the
          configuration     file.       The      default      path      is
          /etc/opensm/prefix-routes.conf.

   -Q, --qos
          This option enables QoS setup. It is disabled by default.

   -Y, --qos_policy_file <file name>
          This  option  defines  the optional QoS policy file. The default
          name         is         /etc/opensm/qos-policy.conf.         See
          QoS_management_in_OpenSM.txt  in opensm doc for more information
          on configuring QoS policy via this file.

   --congestion_control
          (EXPERIMENTAL)   This   option   enables   congestion    control
          configuration.   It is disabled by default.  See config file for
          congestion  control  configuration  options.    --cc_key   <key>
          (EXPERIMENTAL)  This  option  configures  the  CCkey to use when
          configuring congestion control.  Note that this option does  not
          configure a new CCkey into switches and CAs.  Defaults to 0.

   -N, --no_part_enforce (DEPRECATED)
          This  is  a  deprecated flag. Please use --part_enforce instead.
          This option disables partition enforcement  on  switch  external
          ports.

   -Z, --part_enforce [both | in | out | off]
          This  option  indicates  the  partition  enforcement  type  (for
          switches).  Enforcement type can be inbound only (in),  outbound
          only (out), both or disabled (off). Default is both.

   -W, --allow_both_pkeys
          This  option  indicates whether both full and limited membership
          on the same  partition  can  be  configured  in  the  PKeyTable.
          Default is not to allow both pkeys.

   -y, --stay_on_fatal
          This  option  will  cause SM not to exit on fatal initialization
          issues: if SM discovers duplicated guids or a 12x link with lane
          reversal  badly  configured.   By  default,  the SM will exit on
          these errors.

   -B, --daemon
          Run in daemon mode - OpenSM will run in the background.

   -J, --pidfile <file_name>
          Makes the SM write its  own  PID  to  the  specified  file  when
          started in daemon mode.

   -I, --inactive
          Start SM in inactive rather than init SM state.  This option can
          be used  in  conjunction  with  the  perfmgr  so  as  to  run  a
          standalone  performance manager without SM/SA.  However, this is
          NOT currently implemented in the performance manager.

   --perfmgr
          Enable the perfmgr.  Only takes effect if  --enable-perfmgr  was
          specified  at configure time.  See performance-manager-HOWTO.txt
          in opensm doc for more information on running perfmgr.

   --perfmgr_sweep_time_s <seconds>
          Specify the sweep time for the performance  manager  in  seconds
          (default is 180 seconds).  Only takes effect if --enable-perfmgr
          was specified at configure time.

   --consolidate_ipv6_snm_req
          Use shared MLID for IPv6 Solicited  Node  Multicast  groups  per
          MGID scope and P_Key.

   --log_prefix <prefix text>
          This  option  specifies  the  prefix to the syslog messages from
          OpenSM.  A suitable prefix can be used to identify the IB subnet
          in syslog messages when two or more instances of OpenSM run in a
          single node to manage multiple fabrics. For example, in a  dual-
          fabric  (or  dual-rail)  IB  cluster,  the  prefix for the first
          fabric could be "mpi" and the other fabric could be "storage".

   --torus_config <path to torus-2QoS config file>
          This option defines the file name for  the  extra  configuration
          information  needed  for  the  torus-2QoS  routing engine.   The
          default name is /etc/opensm/torus-2QoS.conf

   -v, --verbose
          This option increases the log verbosity level.   The  -v  option
          may   be  specified  multiple  times  to  further  increase  the
          verbosity level.  See the -D option for more  information  about
          log verbosity.

   -V     This  option  sets  the  maximum  verbosity level and forces log
          flushing.  The -V option is equivalent to -D 0xFF -d  2.   See
          the -D option for more information about log verbosity.

   -D <value>
          This  option  sets  the log verbosity level.  A flags field must
          follow  the  -D  option.   A  bit   set/clear   in   the   flags
          enables/disables a specific log level as follows:

           BIT    LOG LEVEL ENABLED
           ----   -----------------
           0x01 - ERROR (error messages)
           0x02 - INFO (basic messages, low volume)
           0x04 - VERBOSE (interesting stuff, moderate volume)
           0x08 - DEBUG (diagnostic, high volume)
           0x10 - FUNCS (function entry/exit, very high volume)
           0x20 - FRAMES (dumps all SMP and GMP frames)
           0x40 - ROUTING (dump FDB routing information)
           0x80  -  SYS  (syslog  at  LOG_INFO level in addition to OpenSM
          logging)

          Without -D, OpenSM defaults to ERROR + INFO  (0x3).   Specifying
          -D  0  disables  all  messages.   Specifying -D 0xFF enables all
          messages (see -V).  High verbosity levels may require increasing
          the transaction timeout with the -t option.

   -d, --debug <value>
          This  option  specifies  a  debug option.  These options are not
          normally needed.  The number  following  -d  selects  the  debug
          option to enable as follows:

           OPT   Description
           ---    -----------------
           -d0  - Ignore other SM nodes
           -d1  - Force single threaded dispatching
           -d2  - Force log flushing after each log message
           -d3  - Disable multicast support

   -h, --help
          Display this usage info then exit.

   -?     Display this usage info then exit.

ENVIRONMENT VARIABLES

   The following environment variables control opensm behavior:

   OSM_TMP_DIR  -  controls  the  directory  in  which the temporary files
   generated by opensm are created. These  files  are:  opensm-subnet.lst,
   opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.

   OSM_CACHE_DIR  -  opensm  stores  certain  data  to  the disk such that
   subsequent  runs  are  consistent.  The  default  directory   used   is
   /var/cache/opensm.  The following files are included in it:

    guid2lid  - stores the LID range assigned to each GUID
    guid2mkey - stores the MKey previously assiged to each GUID
    neighbors - stores a map of the GUIDs at either end of each link
                in the fabric

NOTES

   When  opensm receives a HUP signal, it starts a new heavy sweep as if a
   trap was received or a topology change was found.

   Also, SIGUSR1 can be used to trigger a  reopen  of  /var/log/opensm.log
   for logrotate purposes.

PARTITION CONFIGURATION

   The   default   name   of   OpenSM  partitions  configuration  file  is
   /etc/opensm/partitions.conf. The default may be changed  by  using  the
   --Pconfig (-P) option with OpenSM.

   The  default  partition  will be created by OpenSM unconditionally even
   when partition configuration file does not exist or cannot be accessed.

   The default partition has P_Key value 0x7fff. OpenSMs port will always
   have  full  membership  in  default partition. All other end ports will
   have full membership if the partition configuration file is  not  found
   or cannot be accessed, or limited membership if the file exists and can
   be accessed but there is no rule for the Default partition.

   Effectively, this amounts to the same as if one of the following  rules
   below appear in the partition configuration file.

   In the case of no rule for the Default partition:

   Default=0x7fff : ALL=limited, SELF=full ;

   In  the  case  of  no  partition  configuration  file or file cannot be
   accessed:

   Default=0x7fff : ALL=full ;

   File Format

   Comments:

   Line content followed after # character is  comment  and  ignored  by
   parser.

   General file format:

   <Partition Definition>:[<newline>]<Partition Properties>;

        Partition Definition:
          [PartitionName][=PKey][,indx0][,ipoib_bc_flags][,defmember=full|limited]

           PartitionName  - string, will be used with logging. When
                            omitted, empty string will be used.
           PKey           - P_Key value for this partition. Only low 15
                            bits will be used. When omitted will be
                            autogenerated.
           indx0          - indicates that this pkey should be inserted in
                            block 0 index 0.
           ipoib_bc_flags - used to indicate/specify IPoIB capability of
                            this partition.

           defmember=full|limited|both - specifies default membership for
                            port guid list. Default is limited.

        ipoib_bc_flags:
           ipoib_flag|[mgroup_flag]*

           ipoib_flag:
               ipoib  - indicates that this partition may be used for
                        IPoIB, as a result the IPoIB broadcast group will
                        be created with the mgroup_flag flags given,
                        if any.

        Partition Properties:
          [<Port list>|<MCast Group>]* | <Port list>

        Port list:
           <Port Specifier>[,<Port Specifier>]

        Port Specifier:
           <PortGUID>[=[full|limited|both]]

           PortGUID         - GUID of partition member EndPort.
                              Hexadecimal numbers should start from
                              0x, decimal numbers are accepted too.
           full, limited,   - indicates full and/or limited membership for
           both               this port.  When omitted (or unrecognized)
                              limited membership is assumed.  Both
                              indicates both full and limited membership
                              for this port.

        MCast Group:
           mgid=gid[,mgroup_flag]*<newline>

                            - gid specified is verified to be a Multicast
                              address.  IP groups are verified to match
                              the rate and mtu of the broadcast group.
                              The P_Key bits of the mgid for IP groups are
                              verified to either match the P_Key specified
                              in by "Partition Definition" or if they are
                              0x0000 the P_Key will be copied into those
                              bits.

        mgroup_flag:
           rate=<val>  - specifies rate for this MC group
                         (default is 3 (10GBps))
           mtu=<val>   - specifies MTU for this MC group
                         (default is 4 (2048))
           sl=<val>    - specifies SL for this MC group
                         (default is 0)
           scope=<val> - specifies scope for this MC group
                         (default is 2 (link local)).  Multiple scope
                         settings are permitted for a partition.
                         NOTE: This overwrites the scope nibble of the
                               specified mgid.  Furthermore specifying
                               multiple scope settings will result in
                               multiple MC groups being created.
           Q_Key=<val>     - specifies the Q_Key for this MC group
                             (default: 0x0b1b for IP groups, 0 for other
                              groups)
                             WARNING: changing this for the broadcast
                                      group may break IPoIB on client
                                      nodes!!
           TClass=<val>    - specifies tclass for this MC group
                             (default is 0)
           FlowLabel=<val> - specifies FlowLabel for this MC group
                             (default is 0)

   Note that values for rate, mtu, and  scope,  for  both  partitions  and
   multicast   groups,   should  be  specified  as  defined  in  the  IBTA
   specification (for example, mtu=4 for 2048).

   There are several useful keywords for PortGUID definition:

    - 'ALL' means all end ports in this subnet.
    - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
    - 'ALL_SWITCHES' means all Switch end ports in this subnet.
    - 'ALL_ROUTERS' means all Router end ports in this subnet.
    - 'SELF' means subnet manager's port.

   Empty list means no ports in this partition.

   Notes:

   White space is permitted between delimiters ('=', ',',':',';').

   PartitionName does not need to be unique, PKey does need to be  unique.
   If  PKey is repeated then those partition configurations will be merged
   and first PartitionName will be used (see also next note).

   It is possible to  split  partition  configuration  in  more  than  one
   definition,  but  then  PKey  should be explicitly specified (otherwise
   different PKey values will be generated for those definitions).

   Examples:

    Default=0x7fff : ALL, SELF=full ;
    Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;

    NewPartition , ipoib : 0x123456=full, 0x3456789034=limi,  0x2134af2306
   ;

    YetAnotherOne = 0x300 : SELF=full ;
    YetAnotherOne = 0x300 : ALL=limited ;

    ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
    # 0x123453, 0x123454 will be limited
    ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
    # 0x123456, 0x123457 will be limited
    ShareIO   =   0x80   :   defmember=limited   :   0x123456,   0x123457,
   0x123458=full;
    ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
    ShareIO  =  0x80  ,  defmember=full  :   0x12345b,   0x12345c=limited,
   0x12345d;

    # multicast groups added to default
    Default=0x7fff,ipoib:
           mgid=ff12:401b::0707,sl=1 # random IPv4 group
           mgid=ff12:601b::16    # MLDv2-capable routers
           mgid=ff12:401b::16    # IGMP
           mgid=ff12:601b::2     # All routers
           mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
           ALL=full;

   Note:

   The following rule is equivalent to how OpenSM used to run prior to the
   partition manager:

    Default=0x7fff,ipoib:ALL=full;

QOS CONFIGURATION

   There are a set of QoS related low-level configuration parameters.  All
   these  parameter  names  are  prefixed by "qos_" string. Here is a full
   list of these parameters:

    qos_max_vls    - The maximum number of VLs that will be on the subnet
    qos_high_limit - The limit of High Priority component of VL
                     Arbitration table (IBA 7.6.9)
    qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
                     template
    qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
                     template
                     Both VL arbitration templates are pairs of
                     VL and weight
    qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
                     a list of VLs corresponding to SLs 0-15 (Note
                     that VL15 used here means drop this SL)

   Typical default values (hard-coded in OpenSM initialization) are:

    qos_max_vls 15
    qos_high_limit 0
    qos_vlarb_low
   0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
    qos_vlarb_high
   0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
    qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

   The syntax is compatible with rest of OpenSM configuration options  and
   values may be stored in OpenSM config file (cached options file).

   In  addition  to  the  above,  we may define separate QoS configuration
   parameters sets for various target  types.  As  targets,  we  currently
   support CAs, routers, switch external ports, and switch's enhanced port
   0.  The  names  of  such  specialized  parameters   are   prefixed   by
   "qos_<type>_"  string.  Here  is a full list of the currently supported
   sets:

    qos_ca_  - QoS configuration parameters set for CAs.
    qos_rtr_ - parameters set for routers.
    qos_sw0_ - parameters set for switches' port 0.
    qos_swe_ - parameters set for switches' external ports.

   Examples:
    qos_sw0_max_vls=2
    qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
    qos_swe_high_limit=0

PREFIX ROUTES

   Prefix routes control how the SA responds to path  record  queries  for
   off-subnet  DGIDs.   By  default, the SA fails such queries.  Note that
   IBA does not specify how the SA should obtain  off-subnet  path  record
   information.   The  prefix  routes configuration is meant as a stop-gap
   until the specification is completed.

   Each line in the configuration file is a 64-bit prefix  followed  by  a
   64-bit  GUID,  separated by white space.  The GUID specifies the router
   port on the local subnet that will handle the prefix.  Blank lines  are
   ignored,  as is anything between a # character and the end of the line.
   The prefix and GUID are both  in  hex,  the  leading  0x  is  optional.
   Either,  or  both, can be wild-carded by specifying an asterisk instead
   of an explicit prefix or GUID.

   When responding to a path record query for an off-subnet  DGID,  opensm
   searches  for  the  first  prefix  match  in  the  configuration  file.
   Therefore, the  order  of  the  lines  in  the  configuration  file  is
   important:  a  wild-carded prefix at the beginning of the configuration
   file renders all subsequent lines useless.  If there is no match,  then
   opensm  fails  the  query.   It  is  legal  to  repeat  prefixes in the
   configuration file, opensm will return the path to the first  available
   matching  router.   A  configuration file with a single line where both
   prefix and  GUID  are  wild-carded  means  that  a  path  record  query
   specifying  any  off-subnet  DGID  should  return  a  path to the first
   available router.  This configuration yields the same behavior formerly
   achieved   by   compiling  opensm  with  -DROUTER_EXP  which  has  been
   obsoleted.

MKEY CONFIGURATION

   OpenSM supports configuring a single  management  key  (MKey)  for  use
   across the subnet.

   The following configuration options are available:

    m_key                  - the 64-bit MKey to be used on the subnet
                             (IBA 14.2.4)
    m_key_protection_level - the numeric value of the MKey ProtectBits
                             (IBA 14.2.4.1)
    m_key_lease_period     - the number of seconds a CA will wait for a
                             response from the SM before resetting the
                             protection level to 0 (IBA 14.2.4.2).

   OpenSM  will  configure  all  ports  with  the MKey specified by m_key,
   defaulting to a value of 0. A m_key value of 0 disables MKey protection
   on  the subnet.  Switches and HCAs with a non-zero MKey will not accept
   requests to change their configuration unless the request includes  the
   proper MKey.

   MKey Protection Levels

   MKey  protection  levels  modify  how  switches and CAs respond to SMPs
   lacking a valid MKey.  OpenSM will configure each port's ProtectBits to
   support  the level defined by the m_key_protection_level parameter.  If
   no parameter is specified, OpenSM defaults to operating  at  protection
   level 0.

   There are currently 4 protection levels defined by the IBA:

    0 - Queries return valid data, including MKey.  Configuration changes
        are not allowed unless the request contains a valid MKey.
    1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
        unless the request contains a valid MKey.
    2 - Neither queries nor configuration changes are allowed, unless the
        request contains a valid MKey.
    3 - Identical to 2.  Maintained for backwards compatibility.

   MKey Lease Period

   InfiniBand  supports  a  MKey lease timeout, which is intended to allow
   administrators or a new SM to recover/reset lost MKeys on a fabric.

   If MKeys are enabled on the subnet  and  a  switch  or  CA  receives  a
   request  that  requires a valid MKey but does not contain one, it warns
   the SM by sending a trap (Bad M_Key, Trap  256).   If  the  MKey  lease
   period  is  non-zero,  it  also  starts  a countdown timer for the time
   specified by the lease period.  If a SM (or other agent) responds  with
   the  correct  MKey,  the  timer is stopped and reset.  Should the timer
   reach zero, the switch or CA will reset its MKey protection level to 0,
   exposing the MKey and allowing recovery.

   OpenSM  will  initialize  all  ports  to use a mkey lease period of the
   number  of   seconds   specified   in   the   config   file.    If   no
   mkey_lease_period is specified, a default of 0 will be used.

   OpenSM  normally quickly responds to all Bad_M_Key traps, resetting the
   lease timers.  Additionally, OpenSM's subnet sweeps  will  also  cancel
   any  running  timers.   For  maximum  protection  against accidentally-
   exposed MKeys, the MKey lease time should be a  few  multiples  of  the
   subnet  sweep  time.   If  OpenSM  detects  at  startup that your sweep
   interval is greater than your MKey lease  period,  it  will  reset  the
   lease  period  to  be  greater  than the sweep interval.  Similarly, if
   sweeping is disabled at startup, it will be re-enabled with an interval
   less than the Mkey lease period.

   If  OpenSM  is  required  to  recover  a subnet for which it is missing
   mkeys, it must do so one switch level at a time.  As  such,  the  total
   time  to  recover  the  subnet  may be as long as the mkey lease period
   multiplied by the  maximum  number  of  hops  between  the  SM  and  an
   endpoint, plus one.

   MKey Effects on Diagnostic Utilities

   Setting a MKey may have a detrimental effect on diagnostic software run
   on the subnet, unless your diagnostic  software  is  able  to  retrieve
   MKeys from the SA or can be explicitly configured with the proper MKey.
   This is particularly true at protection level 2, where CAs will  ignore
   queries for management information that do not contain the proper MKey.

ROUTING

   OpenSM now offers nine routing engines:

   1.   Min  Hop  Algorithm - based on the minimum hops to each node where
   the path length is optimized.

   2.  UPDN Unicast routing algorithm - also based on the minimum hops  to
   each  node,  but  it  is  constrained  to ranking rules. This algorithm
   should be chosen if the subnet is not a pure Fat Tree, and deadlock may
   occur due to a loop in the subnet.

   3.  DNUP Unicast routing algorithm - similar to UPDN but allows routing
   in fabrics which have some CA nodes attached closer to the  roots  than
   some switch nodes.

   4.   Fat  Tree  Unicast  routing  algorithm  - this algorithm optimizes
   routing for congestion-free "shift" communication pattern.   It  should
   be  chosen  if a subnet is a symmetrical or almost symmetrical fat-tree
   of various types, not just K-ary-N-Trees:  non-constant  K,  not  fully
   staffed,  any  Constant  Bisectional Bandwidth (CBB) ratio.  Similar to
   UPDN, Fat Tree routing is constrained to ranking rules.

   5. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
   to  provide deadlock-free shortest-path routing while also distributing
   the  paths  between  layers.  LASH  is  an  alternative   deadlock-free
   topology-agnostic  routing  algorithm to the non-minimal UPDN algorithm
   avoiding the use of a potentially congested root node.

   6. DOR Unicast routing algorithm - based on the Min Hop algorithm,  but
   avoids  port  equalization  except for redundant links between the same
   two switches.  This provides deadlock free routes for  hypercubes  when
   the  fabric  is  cabled  as a hypercube and for meshes when cabled as a
   mesh (see details below).

   7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
   specialized  for 2D/3D torus topologies.  Torus-2QoS provides deadlock-
   free routing while supporting two quality of service (QoS) levels.   In
   addition  it  is able to route around multiple failed fabric links or a
   single failed fabric switch without introducing deadlocks, and  without
   changing path SL values granted before the failure.

   8.  DFSSSP  unicast  routing algorithm - a deadlock-free single-source-
   shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
   as  the  base  to optimize link utilization and uses Infiniband virtual
   lanes (SL) to provide deadlock-freedom.

   9. SSSP  unicast  routing  algorithm  -  a  single-source-shortest-path
   routing  algorithm,  which  globally  balances the number of routes per
   link to optimize  link  utilization.  This  routing  algorithm  has  no
   restrictions in terms of the underlying topology.

   OpenSM  also supports a file method which can load routes from a table.
   See Modular Routing Engine for more information on this.

   The basic routing algorithm is comprised of two stages:

   1. MinHop matrix calculation
      How many hops are required to get from each port to each LID ?
      The algorithm to fill these tables is different if you run  standard
   (min hop) or Up/Down.
      For  standard routing, a "relaxation" algorithm is used to propagate
   min hop from every destination LID through neighbor switches
      For Up/Down routing, a BFS from every target is used. The BFS tracks
   link  direction (up or down) and avoid steps that will perform up after
   a down step was used.

   2. Once MinHop matrices exist, each switch  is  visited  and  for  each
   target  LID a decision is made as to what port should be used to get to
   that LID.
      This step is common to standard and Up/Down routing. Each port has a
   counter counting the number of target LIDs going through it.
      When there are multiple alternative ports with same MinHop to a LID,
   the one with less previously assigned LIDs is selected.
      If LMC > 0, more  checks  are  added:  Within  each  group  of  LIDs
   assigned to same target port,
      a. use only ports which have same MinHop
      b.  first prefer the ones that go to different systemImageGuid (then
   the previous LID of the same LMC group)
      c. if none - prefer those which go through another NodeGuid
      d. fall back to the number of paths method (if all go to same node).

   Effect of Topology Changes

   OpenSM will preserve existing routing in any case  where  there  is  no
   change in the fabric switches unless the -r (--reassign_lids) option is
   specified.

   -r
   --reassign_lids
             This option causes OpenSM to reassign LIDs to all
             end nodes. Specifying -r on a running subnet
             may disrupt subnet traffic.
             Without -r, OpenSM attempts to preserve existing
             LID assignments resolving multiple use of same LID.

   If a link is added or removed, OpenSM does not recalculate  the  routes
   that  do  not  have  to change. A route has to change if the port is no
   longer UP or no longer the MinHop. When routing changes are  performed,
   the same algorithm for balancing the routes is invoked.

   In  the  case of using the file based routing, any topology changes are
   currently ignored The 'file' routing engine just loads  the  LFTs  from
   the  file specified, with no reaction to real topology. Obviously, this
   will not be able to recheck LIDs (by GUID) for disconnected nodes,  and
   LFTs  for  non-existent  switches  will  be  skipped.  Multicast is not
   affected by 'file' routing engine (this uses min hop tables).

   Min Hop Algorithm

   The Min Hop algorithm is invoked by default if no routing algorithm  is
   specified.  It can also be invoked by specifying '-R minhop'.

   The  Min  Hop algorithm is divided into two stages: computation of min-
   hop tables on  every  switch  and  LFT  output  port  assignment.  Link
   subscription  is  also  equalized with the ability to override based on
   port GUID. The latter is supplied by:

   -i <equalize-ignore-guids-file>
   --ignore_guids <equalize-ignore-guids-file>
             This option provides the means to define a set of ports
             (by guid) that will be ignored by the link load
             equalization algorithm. Note that only endports (CA,
             switch port 0, and router ports) and not switch external
             ports are supported.

   LMC awareness routes based on (remote) system or switch basis.

   Purpose of UPDN Algorithm

   The UPDN algorithm is designed to prevent deadlocks from  occurring  in
   loops  of  the subnet. A loop-deadlock is a situation in which it is no
   longer possible to send data between any two  hosts  connected  through
   the  loop.  As  such,  the UPDN routing algorithm should be used if the
   subnet is not a pure Fat Tree, and one of its loops  may  experience  a
   deadlock (due, for example, to high pressure).

   The UPDN algorithm is based on the following main stages:

   1.  Auto-detect root nodes - based on the CA hop length from any switch
   in the subnet, a statistical histogram is built for  each  switch  (hop
   num  vs  number  of  occurrences). If the histogram reflects a specific
   column (higher than others) for a certain node, then it is marked as  a
   root node. Since the algorithm is statistical, it may not find any root
   nodes. The list of the root nodes found by this  auto-detect  stage  is
   used by the ranking process stage.

       Note 1: The user can override the node list manually.
       Note 2: If this stage cannot find any root nodes, and the user did
               not specify a guid list file, OpenSM defaults back to the
               Min Hop routing algorithm.

   2.   Ranking  process  -  All  root switch nodes (found in stage 1) are
   assigned a rank of 0. Using the BFS algorithm, the rest of  the  switch
   nodes  in the subnet are ranked incrementally. This ranking aids in the
   process of enforcing rules that ensure loop-free paths.

   3.  Min Hop Table setting - after ranking is done, a BFS  algorithm  is
   run  from  each  (CA  or  switch)  node  in  the subnet. During the BFS
   process, the FDB table of each switch node traversed by BFS is updated,
   in  reference to the starting node, based on the ranking rules and guid
   values.

   At the end of the process, the  updated  FDB  tables  ensure  loop-free
   paths through the subnet.

   Note:  Up/Down routing does not allow LID routing communication between
   switches that are located inside spine "switch systems".  The reason is
   that  there  is  no way to allow a LID route between them that does not
   break the Up/Down rule.  One ramification of this is  that  you  cannot
   run SM on switches other than the leaf switches of the fabric.

   UPDN Algorithm Usage

   Activation through OpenSM

   Use  '-R  updn'  option  (instead  of  old  '-u')  to activate the UPDN
   algorithm.  Use '-a <root_guid_file>' for adding an UPDN guid file that
   contains  the  root nodes for ranking.  If the `-a' option is not used,
   OpenSM uses its auto-detect root nodes algorithm.

   Notes on the guid list file:

   1.   A valid guid file specifies one guid in each line. Lines  with  an
   invalid format will be discarded.
   2.   The user should specify the root switch guids. However, it is also
   possible to specify CA guids; OpenSM will use the guid  of  the  switch
   (if it exists) that connects the CA to the subnet as a root node.

   Purpose of DNUP Algorithm

   The  DNUP  algorithm  is  designed  to serve a similar purpose to UPDN.
   However it is intended to work in network topologies which are unsuited
   to  UPDN  due to nodes being connected closer to the roots than some of
   the switches.  An example would be a fabric which  contains  nodes  and
   uplinks connected to the same switch. The operation of DNUP is the same
   as UPDN with the exception of the ranking process.  In DNUP all  switch
   nodes  are  ranked  based  solely  on their distance from CA Nodes, all
   switch nodes directly connected to at least one CA are assigned a value
   of  1  all other switch nodes are assigned a value of one more than the
   minimum rank of all neighbor switch nodes.

   Fat-tree Routing Algorithm

   The fat-tree algorithm  optimizes  routing  for  "shift"  communication
   pattern.   It  should  be chosen if a subnet is a symmetrical or almost
   symmetrical fat-tree of various types.  It supports not  just  K-ary-N-
   Trees,  by handling for non-constant K, cases where not all leafs (CAs)
   are present, any CBB ratio.  As in UPDN, fat-tree also prevents credit-
   loop-deadlocks.

   If  the  root  guid  file  is  not provided ('-a' or '--root_guid_file'
   options), the topology has to be pure fat-tree that complies  with  the
   following rules:
     - Tree rank should be between two and eight (inclusively)
     - Switches of the same rank should have the same number
       of UP-going port groups*, unless they are root switches,
       in which case the shouldn't have UP-going ports at all.
     - Switches of the same rank should have the same number
       of DOWN-going port groups, unless they are leaf switches.
     - Switches of the same rank should have the same number
       of ports in each UP-going port group.
     - Switches of the same rank should have the same number
       of ports in each DOWN-going port group.
     - All the CAs have to be at the same tree level (rank).

   If the root guid file is provided, the topology doesn't have to be pure
   fat-tree, and it should only comply with the following rules:
     - Tree rank should be between two and eight (inclusively)
     - All the Compute Nodes** have to be at the same tree level (rank).
       Note that non-compute node CAs are allowed here to be at different
       tree ranks.

   * ports that are connected to the same remote switch are referenced  as
   port group.

   **   list   of  compute  nodes  (CNs)  can  be  specified  by  -u  or
   --cn_guid_file OpenSM options.

   Topologies that do not comply cause a  fallback  to  min  hop  routing.
   Note that this can also occur on link failures which cause the topology
   to no longer be "pure" fat-tree.

   Note that although fat-tree algorithm supports trees  with  non-integer
   CBB  ratio,  the  routing will not be as balanced as in case of integer
   CBB ratio.  In addition to this, although  the  algorithm  allows  leaf
   switches  to have any number of CAs, the closer the tree is to be fully
   populated, the more effective the "shift"  communication  pattern  will
   be.   In  general,  even  if  the root list is provided, the closer the
   topology to a pure and  symmetrical  fat-tree,  the  more  optimal  the
   routing will be.

   The  algorithm  also dumps compute node ordering file (opensm-ftree-ca-
   order.dump) in the same directory where the OpenSM  log  resides.  This
   ordering  file  provides  the  CN  order  that  may  be  used to create
   efficient communication pattern, that will match the routing tables.

   Routing between non-CN nodes

   The use of the cn_guid_file option allows non-CN nodes to be located on
   different  levels  in the fat tree.  In such case, it is not guaranteed
   that the Fat Tree algorithm will route between two  non-CN  nodes.   To
   solve  this problem, a list of non-CN nodes can be specified by -G or
   --io_guid_file option.  Theses nodes will be allowed to use  switches
   the  wrong  way  round a specific number of times (specified by -H or
   --max_reverse_hops.    With   the   proper    max_reverse_hops    and
   io_guid_file values, you can ensure full connectivity in the Fat Tree.

   Please  note  that  using  max_reverse_hops creates routes that use the
   switch in a counter-stream way.  This option should never  be  used  to
   connect nodes with high bandwidth traffic between them ! It should only
   be used to allow connectivity for HA purposes or similar.  Also  having
   routes the other way around can in theory cause credit loops.

   Use these options with extreme care !

   Activation through OpenSM

   Use  '-R  ftree'  option  to  activate the fat-tree algorithm.  Use '-a
   <root_guid_file>' to provide root nodes for ranking. If the `-a' option
   is  not  used,  routing algorithm will detect roots automatically.  Use
   '-u <root_cn_file>' to provide the list of compute nodes. If  the  `-u'
   option is not used, all the CAs are considered as compute nodes.

   Note:  LMC  >  0  is  not  supported  by  fat-tree  routing. If this is
   specified, the default routing algorithm is invoked instead.

   LASH Routing Algorithm

   LASH is  an  acronym  for  LAyered  SHortest  Path  Routing.  It  is  a
   deterministic  shortest  path  routing  algorithm that enables topology
   agnostic deadlock-free routing within communication networks.

   When computing the routing function, LASH analyzes the network topology
   for   the   shortest-path   routes  between  all  pairs  of  sources  /
   destinations and groups these paths into virtual layers in such  a  way
   as to avoid deadlock.

   Note  LASH  analyzes routes and ensures deadlock freedom between switch
   pairs. The link from HCA between  and  switch  does  not  need  virtual
   layers as deadlock will not arise between switch and HCA.

   In more detail, the algorithm works as follows:

   1)  LASH  determines  the  shortest-path  between all pairs of source /
   destination switches. Note, LASH ensures the same SL is  used  for  all
   SRC/DST  - DST/SRC pairs and there is no guarantee that the return path
   for a given DST/SRC will be the reverse of the route SRC/DST.

   2) LASH then begins an SL assignment process where a route is  assigned
   to  a  layer (SL) if the addition of that route does not cause deadlock
   within that layer. This is achieved  by  maintaining  and  analysing  a
   channel dependency graph for each layer. Once the potential addition of
   a path could lead to deadlock, LASH opens a new layer and continues the
   process.

   3)  Once  this  stage  has been completed, it is highly likely that the
   first layers processed will contain more paths than  the  latter  ones.
   To better balance the use of layers, LASH moves paths from one layer to
   another so that the number of paths in each layer averages out.

   Note, the implementation of LASH in  opensm  attempts  to  use  as  few
   layers  as  possible. This number can be less than the number of actual
   layers available.

   In general LASH is a very flexible  algorithm.  It  can,  for  example,
   reduce to Dimension Order Routing in certain topologies, it is topology
   agnostic and fares well in the face of faults.

   It has been shown that for both regular and irregular topologies,  LASH
   outperforms  Up/Down.  The reason for this is that LASH distributes the
   traffic more evenly through a network, avoiding the  bottleneck  issues
   related to a root node and always routes shortest-path.

   The algorithm was developed by Simula Research Laboratory.

   Use '-R lash -Q ' option to activate the LASH algorithm.

   Note:  QoS support has to be turned on in order that SL/VL mappings are
   used.

   Note: LMC > 0 is  not  supported  by  the  LASH  routing.  If  this  is
   specified, the default routing algorithm is invoked instead.

   For  open  regular  cartesian  meshes  the  DOR  algorithm is the ideal
   routing algorithm. For toroidal meshes on  the  other  hand  there  are
   routing loops that can cause deadlocks. LASH can be used to route these
   cases. The performance of LASH can be improved by  preconditioning  the
   mesh  in  cases  where there are multiple links connecting switches and
   also in cases where the switches are not cabled consistently. An option
   exists   for  LASH  to  do  this.  To  invoke  this  use  '-R  lash  -Q
   --do_mesh_analysis'. This will add an additional  phase  that  analyses
   the  mesh  to  try to determine the dimension and size of a mesh. If it
   determines that the mesh looks like an open or closed cartesian mesh it
   reorders  the  ports  in  dimension  order  before the rest of the LASH
   algorithm runs.

   DOR Routing Algorithm

   The Dimension Order Routing algorithm is based on the Min Hop algorithm
   and  so  uses  shortest paths.  Instead of spreading traffic out across
   different paths with the same shortest distance, it chooses  among  the
   available shortest paths based on an ordering of dimensions.  Each port
   must be consistently cabled to represent a  hypercube  dimension  or  a
   mesh  dimension.   Alternatively, the -O option can be used to assign a
   custom mapping between the ports on a given switch, and the  associated
   dimension.   Paths  are grown from a destination back to a source using
   the lowest dimension (port) of available  paths  at  each  step.   This
   provides  the  ordering  necessary  to  avoid deadlock.  When there are
   multiple links between any two switches, they still represent only  one
   dimension  and traffic is balanced across them unless port equalization
   is turned off.  In the case of hypercubes, the same port must  be  used
   throughout the fabric to represent the hypercube dimension and match on
   both ends of the cable,  or  the  -O  option  used  to  accomplish  the
   alignment.   In  the  case of meshes, the dimension should consistently
   use the same pair of ports, one port on one end of the cable,  and  the
   other  port  on  the other end, continuing along the mesh dimension, or
   the -O option used as an override.

   Use '-R dor' option to activate the DOR algorithm.

   DFSSSP and SSSP Routing Algorithm

   The (Deadlock-Free) Single-Source-Shortest-Path  routing  algorithm  is
   designed  to optimize link utilization thru global balancing of routes,
   while supporting arbitrary topologies.  The  DFSSSP  routing  algorithm
   uses Infiniband virtual lanes (SL) to provide deadlock-freedom.

   The DFSSSP algorithm consists of five major steps:
   1)  It  discovers  the  subnet  and  models  the  subnet  as a directed
   multigraph in which each node represents a node of the physical network
   and each edge represents one direction of the full-duplex links used to
   connect the nodes.
   2) A loop, which iterates over all CA and switches of the subnet,  will
   perform  three  steps to generate the linear forwarding tables for each
   switch:
   2.1) use Dijkstra's algorithm to find the shortest path from all  nodes
   to the current selected destination;
   2.2)  update  the  edge  weights  in  the graph, i.e. add the number of
   routes, which use a link to reach the destination, to the link/edge;
   2.3) update the LFT of each switch with the  outgoing  port  which  was
   used in the current step to route the traffic to the destination node.
   3)  After the number of available virtual lanes or layers in the subnet
   is detected and a channel dependency  graph  is  initialized  for  each
   layer,  the  algorithm  will put each possible route of the subnet into
   the first layer.
   4) A loop  iterates  over  all  channel  dependency  graphs  (CDG)  and
   performs the following substeps:
   4.1) search for a cycle in the current CDG;
   4.2)  when  a  cycle is found, i.e. a possible deadlock is present, one
   edge is selected and all routes, which induced this edge, are moved  to
   the "next higher" virtual layer (CDG[i+1]);
   4.3)  the  cycle  search  is  continued until all cycles are broken and
   routes are moved "up".
   5) When the number of needed layers does  not  exceeds  the  number  of
   available  SL/VL  to  remove  all  cycles  in all CDGs, the rounting is
   deadlock-free and an relation table is generated,  which  contains  the
   assignment of routes from source to destination to a SL

   Note on SSSP:
   This  algorithm  does  not  perform  the  steps  3)-5)  and  can not be
   considered to be deadlock-free for all topologies. But on the one hand,
   you can choose this algorithm for really large networks (5,000+ CAs and
   deadlock-free by design) to reduce the runtime of the algorithm. On the
   other hand, you might use the SSSP routing algorithm as an alternative,
   when all deadlock-free routing algorithms fail to route the network for
   whatever  reason.   In  the  last case, SSSP was designed to deliver an
   equal or higher bandwidth due to better congestion avoidance  than  the
   Min Hop routing algorithm.

   Notes for usage:
   a) running DFSSSP: '-R dfsssp -Q'
   a.1)  QoS  has  to  be  configured  to  equally  spread the load on the
   available SL or virtual lanes
   a.2) applications must perform a path record query to get path  SL  for
   each route, which the application will use to transmite packages
   b) running SSSP:   '-R sssp'
   c) both algorithms support LMC > 0

   Hints for optimizing I/O traffic:
   Having more nodes (I/O and compute) connected to a switch than incoming
   links can result in a 'bad' routing of  the  I/O  traffic  as  long  as
   (DF)SSSP  routing is not aware of the dedicated I/O nodes, i.e., in the
   following network configuration CN1-CN3 might send all I/O traffic  via
   Link2 to IO1,IO2:

        CN1         Link1        IO1
           \       /----\       /
     CN2 -- Switch1      Switch2 -- CN4
           /       \----/       \
        CN3         Link2        IO2

   To  prevent  this from happening (DF)SSSP can use both the compute node
   guid  file  and  the  I/O  guid  file  specified   by   the   -u   or
   --cn_guid_file  and  -G or --io_guid_file options (similar to the
   Fat-Tree routing).  This ensures that traffic towards compute nodes and
   I/O  nodes  is balanced separately and therefore distributed as much as
   possible across the available links. Port GUIDs, as listed  by  ibstat,
   must be specified (not Node GUIDs).
   The priority for the optimization is as follows:
     compute nodes -> I/O nodes -> other nodes
   Possible use case szenarios:
   a)  neither  -u nor -G are specified: all nodes a treated as other
   nodes and therefore balanced equally;
   b) -G is specified:  traffic  towards  I/O  nodes  will  be  balanced
   optimally;
   c)  the  system  has three node types, such as login/admin, compute and
   I/O, but the balancing focus should be I/O, then one has  to  use  -u
   and  -G  with I/O guids listed in cn_guid_file and compute node guids
   listed in io_guid_file;
   d) ...

   Torus-2QoS Routing Algorithm

   Torus-2QoS is routing algorithm designed for  large-scale  2D/3D  torus
   fabrics; see torus-2QoS(8) for full documentation.

   Use  '-R  torus-2QoS  -Q' or '-R torus-2QoS,no_fallback -Q' to activate
   the torus-2QoS algorithm.

   Routing References

   To learn more about deadlock-free routing, see  the  article  "Deadlock
   Free  Message  Routing  in  Multiprocessor Interconnection Networks" by
   William J Dally and Charles L Seitz (1985).

   To learn more about the up/down algorithm, see the  article  "Effective
   Strategy  to Compute Forwarding Tables for InfiniBand Networks" by Jose
   Carlos Sancho, Antonio  Robles,  and  Jose  Duato  at  the  Universidad
   Politecnica de Valencia.

   To learn more about LASH and the flexibility behind it, the requirement
   for layers,  performance  comparisons  to  other  algorithms,  see  the
   following articles:

   "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
   on Parallel and Distributed Systems, VOL.16, No12, December 2005.

   "Routing  for  the  ASI  Fabric   Manager",   Solheim   et   al.   IEEE
   Communications Magazine, Vol.44, No.7, July 2006.

   "Layered   Shortest  Path  (LASH)  Routing  in  Irregular  System  Area
   Networks",  Skeie  et   al.   IEEE   Computer   Society   Communication
   Architecture for Clusters 2002.

   To  learn  more  about  the  DFSSSP and SSSP routing algorithm, see the
   articles:
   J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing  for
   Arbitrary  Topologies,  In  Proceedings  of the 25th IEEE International
   Parallel & Distributed Processing Symposium (IPDPS 2011)
   T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-
   Scale  InfiniBand  Networks,  In  17th  Annual  IEEE  Symposium on High
   Performance Interconnects (HOTI 2009)

   Modular Routine Engine

   Modular routing engine structure allows for the ease of "plugging"  new
   routing modules.

   Currently, only unicast callbacks are supported. Multicast can be added
   later.

   One existing routing module is up-down "updn", which may  be  activated
   with '-R updn' option (instead of old '-u').

   General usage is: $ opensm -R 'module-name'

   There is also a trivial routing module which is able to load LFT tables
   from a file.

   Main features:

    - this will load switch LFTs and/or LID matrices (min hops tables)
    - this will load switch LFTs according to the path entries introduced
      in the file
    - no additional checks will be performed (such as "is port connected",
      etc.)
    - in case when fabric LIDs were changed this will try to reconstruct
      LFTs correctly if endport GUIDs are represented in the file
      (in order to disable this, GUIDs may be removed from the file
       or zeroed)

   The file format is compatible with output of  'ibroute'  util  and  for
   whole fabric can be generated with dump_lfts.sh script.

   To activate file based routing module, use:

     opensm -R file -U /path/to/lfts_file

   If  the  lfts_file  is  not  found  or is in error, the default routing
   algorithm is utilized.

   The ability to dump switch lid matrices (aka min hops tables)  to  file
   and later to load these is also supported.

   The  usage  is similar to unicast forwarding tables loading from a lfts
   file (introduced by 'file' routing engine), but  new  lid  matrix  file
   name  should  be  specified  by  -M  or  --lid_matrix_file  option. For
   example:

     opensm -R file -M ./opensm-lid-matrix.dump

   The dump file is named opensm-lid-matrix.dump and will  be  generated
   in   standard   opensm   dump  directory  (/var/log  by  default)  when
   OSM_LOG_ROUTING logging flag is set.

   When routing engine 'file' is activated,  but  the  lfts  file  is  not
   specified  or  not  cannot be open default lid matrix algorithm will be
   used.

   There is also a switch forwarding tables dumper which generates a  file
   compatible with dump_lfts.sh output. This file can be used as input for
   forwarding tables loading by 'file' routing engine.   Both  or  one  of
   options -U and -M can be specified together with -R file.

PER MODULE LOGGING CONFIGURATION

   To  enable per module logging, configure per_module_logging_file to the
   per module logging config file name in  the  opensm  options  file.  To
   disable, configure per_module_logging_file to (null) there.

   The per module logging config file format is a set of lines with module
   name and logging level as follows:

    <module name><separator><logging level>

    <module name> is the file name including .c
    <separator> is either = , space, or tab
    <logging level> is the same levels as used in the coarse/overall
    logging as follows:

    BIT    LOG LEVEL ENABLED
    ----   -----------------
    0x01 - ERROR (error messages)
    0x02 - INFO (basic messages, low volume)
    0x04 - VERBOSE (interesting stuff, moderate volume)
    0x08 - DEBUG (diagnostic, high volume)
    0x10 - FUNCS (function entry/exit, very high volume)
    0x20 - FRAMES (dumps all SMP and GMP frames)
    0x40 - ROUTING (dump FDB routing information)
    0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)

FILES

   /etc/opensm/opensm.conf
          default OpenSM config file.

   /etc/opensm/ib-node-name-map
          default  node  name  map  file.   See  ibnetdiscover  for   more
          information on format.

   /etc/opensm/partitions.conf
          default partition config file

   /etc/opensm/qos-policy.conf
          default QOS policy config file

   /etc/opensm/prefix-routes.conf
          default prefix routes file

   /etc/opensm/per-module-logging.conf
          default per module logging config file

   /etc/opensm/torus-2QoS.conf
          default torus-2QoS config file

AUTHORS

   Hal Rosenstock
          <[email protected]>

   Sasha Khapyorsky
          <[email protected]>

   Eitan Zahavi
          <[email protected]>

   Yevgeny Kliteynik
          <[email protected]>

   Thomas Sodring
          <[email protected]>

   Ira Weiny
          <[email protected]>

   Dale Purdy
          <[email protected]>

SEE ALSO

   torus-2QoS(8), torus-2QoS.conf(5).



Opportunity


Personal Opportunity - Free software gives you access to billions of dollars of software at no cost. Use this software for your business, personal use or to develop a profitable skill. Access to source code provides access to a level of capabilities/information that companies protect though copyrights. Open source is a core component of the Internet and it is available to you. Leverage the billions of dollars in resources and capabilities to build a career, establish a business or change the world. The potential is endless for those who understand the opportunity.

Business Opportunity - Goldman Sachs, IBM and countless large corporations are leveraging open source to reduce costs, develop products and increase their bottom lines. Learn what these companies know about open source and how open source can give you the advantage.


Free Software


Free Software provides computer programs and capabilities at no cost but more importantly, it provides the freedom to run, edit, contribute to, and share the software. The importance of free software is a matter of access, not price. Software at no cost is a benefit but ownership rights to the software and source code is far more significant.

Free Office Software - The Libre Office suite provides top desktop productivity tools for free. This includes, a word processor, spreadsheet, presentation engine, drawing and flowcharting, database and math applications. Libre Office is available for Linux or Windows.


Free Books


The Free Books Library is a collection of thousands of the most popular public domain books in an online readable format. The collection includes great classical literature and more recent works where the U.S. copyright has expired. These books are yours to read and use without restrictions.

Source Code - Want to change a program or know how it works? Open Source provides the source code for its programs so that anyone can use, modify or learn how to write those programs themselves. Visit the GNU source code repositories to download the source.


Education


Study at Harvard, Stanford or MIT - Open edX provides free online courses from Harvard, MIT, Columbia, UC Berkeley and other top Universities. Hundreds of courses for almost all major subjects and course levels. Open edx also offers some paid courses and selected certifications.

Linux Manual Pages - A man or manual page is a form of software documentation found on Linux/Unix operating systems. Topics covered include computer programs (including library and system calls), formal standards and conventions, and even abstract concepts.