Job Execution – High Performance Computing

Execution of Programs

Notice that "." (dot) representing the current working directory is not added to your default search path (PATH). In order to run executables located in the current working directory add "./" in front of the executable name

$ ./myprog

or alternatively specify the absolute path to the executable.

OpenMP Applications

Before running multi-threaded code set the number of threads using the OMP_NUM_THREADS environment variable. E.g.

$ export OMP_NUM_THREADS=4
$ ./myprog

See sample Openmp Job script for running batch jobs.

MPI Applications

Use the mpirun command to start MPI programs. E.g. running 8 MPI instances of myprog

$ mpirun -np 8 ./myprog

For a complete specification of the option list, see the man mpirun page.

When running in the batch system the mpiexec command provided with PBS Pro is a wrapper script that assembles the correct host list and corresponding mpirun command before executing the assembled mpirun command. The mpiexec_mpt command that comes with SGI MPT is an alternative to mpiexec. Unlike the mpiexec command mpiexec_mpt supports all MPT mpirun global options. See sample batch MPI Job script below. For more information see the man mpiexec_mpt page.

Hybrid MPI/OpenMP Applications

The omplace command causes the successive threads in a hybrid MPI/OpenMP job to be placed on unique CPUs. For example, running a 2-process MPI job with 3 threads per process

$ mpirun -np 2 omplace -nt 3 ./myprog

the threads would be placed as follows

rank 0 thread 0 on CPU 0
rank 0 thread 1 on CPU 1
rank 0 thread 2 on CPU 2
rank 1 thread 0 on CPU 3
rank 1 thread 1 on CPU 4
rank 1 thread 2 on CPU 5

See sample Hybrid MPI/OpenMP Job script below for running a batch job. For more information see the man omplace page.

Runtime Environment

Dynamic Libraries

When you build an application with a particular compiler and/or libraries with dynamic linking of libraries at run time, make sure the same compiler and library modules are loaded when running the application.

Binary Data (Endianess)

Vilje will by default write and read Fortran sequential unformatted files in little-endian format. The Intel Fortran compiler can write and read big-endian files using a little-endian-to-big-endian conversion feature. To use big-endian unformatted files (e.g. created on Njord) for I/O on Vilje use the F_UFMTENDIAN environment variable on the numbers of the units to be used for conversion. Example, do big-endian-to-little-endian conversion on files with unit numbers 10 and 20:

$ export F_UFMTENDIAN=10,20

See the Intel® Fortran Compiler User and Reference Guides for more information.

Hyper-threading

Hyper-threading is enabled by default on compute nodes. I.e. for each physical processor core the operating system sees two virtual processors. This means for each compute node there are 32 virtual processors present.

MPI Run-time Environment

MPI keeps track of certain resource utilization statistics. These can be used to determine potential performance problems caused by lack of MPI message buffers and other MPI internal resources. To turn on the displaying of MPI internal statistics, use the -stats option on the mpiexec_mpt (or mpirun) command, or set the MPI_STATS variable on:

$ export MPI_STATS=1

MPI internal statistics are always being gathered, so displaying them does not cause significant additional overhead. By default data is sent to stderr. This can be changed by specifying the MPI_STATS_FILE variable. This file is written to by all processes, each line is prefixed by the host number and global rank of the process.

MPI Buffer Resources

By default, the SGI MPI implementation buffers messages whose lengths exceed 64 bytes into a shared memory region to allow for exchange of data between MPI processes. Because there is a finite number of these shared memory buffers, this can be a constraint on the overall application performance for certain communication patterns. If the MPI statistics file include lines with high numbers of

...retries allocating mpi PER_PROC buffers...

and/or

...retries allocating mpi PER_HOST buffers...

increase the numbers of these buffers using the MPI_BUFS_PER_PROC (defaults to 32) and/or MPI_BUFS_PER_HOST (defaults to 96) variables respectively. E.g.

$ export MPI_BUFS_PER_PROC=512

Keep in mind that increasing the number of buffers does consume more memory.

Single Copy Optimization

For message transfers between MPI processes it is possible under certain conditions to avoid the need to buffer messages. This single copy technique, using memory mapping, may result in better performance since it improves MPI’s bandwidth. Using memory mapping within SGI MPT is disabled by default on Vilje. To enable it, specify

$ export MPI_MEMMAP_OFF=0

The size of messages for which MPI attempts to use the single copy method is controlled by the MPI_BUFFER_MAX variable. In general, a value of 2000 or higher is beneficial for many applications. E.g.

$ export MPI_BUFFER_MAX=2048

i.e. MPI should try to avoid buffering for messages larger than 2048 bytes. Highly synchronized applications that perform large message transfers can benefit from the single-copy method. However, single copy transfers may introduce additional synchronization points, which can reduce application performance in some cases.

Process Placement

If you are running less than 16 MPI tasks per node, e.g. running a MPI/OpenMP hybrid code or if the available memory per node is insufficient for running 16 MPI tasks, you should distribute the processes evenly between the two sockets using the MPI_DSM_CPULIST environment variable. Specify in the jobscript:

$ export MPI_DSM_CPULIST=0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15:allhosts

The list specified above will work for any number of MPI tasks (up to 16) specified per node. This will place MPI ranks 0,2,4,...14 on socket 1 (with cpuids 0,1,...7) and MPI ranks 1,3,5,...15 on socket 2 (with cpuids 8,9,...15). Specifying allhosts indicates that the cpu list pattern applies to all compute nodes allocated for the job.

Setting the MPI_DSM_VERBOSE variable will direct MPI to print the host placement information to the standard error file:

$ export MPI_DSM_VERBOSE=1

Notice, there is no need to specify the MPI_DSM_CPULIST variable when running 16 MPI tasks on each node.

For a full list of the MPI environment settings, see the 'ENVIRONMENT VARIABLES' section in the man mpi page.

Job Accounting

Jobs are accounted for the wall clock time used to finish the job, multiplied with 16 (the number of physical cores per node) times the number of nodes reserved.

For an overview of CPU-time charged to account(s) you are a member of use command cost. Add --help for a description of available options:

$ cost --help

Report processor core hour usage for a user or project.

options:
  -h, --help            Show this help message and exit

  -p ACCOUNT, --project=ACCOUNT
                        Account/project for which to report accounting data. If no
                        argument is given, all projects with user $USER arelisted.

  -P PERIOD, --period=PERIOD
                        Show information for specified NOTUR period in format
                        YYYY.P, e.g. 2010.1
                        NB. If used with -p, will show only information regarding
                        users currently accessible projects.

  -s                    Report in seconds (default: hours).

Without options, the command shows the core hours used by the user for all
projects that the user account is connected to.

With option -p, the command shows for the specified project, the total number of
core hours used, the total number of core hours reserved by unfinished jobs, the
quota with prioritized core hours and the quota with unprioritized core hours.

PBS (Portable Batch System)

Job Scheduling Policy

Jobs are ordered by submission time and tentatively scheduled in the sorted order, but with backfilling enabled. The scheduler will traverse the list of jobs in order to backfill, i.e. use available resources as long as the estimated start times of earlier jobs in the list is not pushed into the future.

PBS Commands

Command	Description
`qsub`	Submit a job
`qdel` <jobid>	Delete a job
`qstat`	Request the status of jobs

For example, submit a job using a job script myjob.pbs:

$ qsub myjob.pbs
1234567.service2

Check the status of my job:

$ qstat -f 1234567.service2

Job Submission Options

Below is a list of some of the options for the qsub command. For a complete list of all options see the 'man qsub' page.

Option	Description
-N <name>	Specify a name for the job
-A <account>	Specify the account name
-o <path>	Specify a name for the stdout file
-e <path>	Specify a name for the stderr file
-m [a\|b\|e]	Specify email notification when the job starts (b), ends (e), or if it aborts (a)
-M <email>	Specify the email address to send notification
-I	Job is to be run interactively
-X	Enable X11 forwarding
-l resource_list	Specify the set of resources requested for the job (see Consumable Resources below)

Consumable Resources

A chunk is the smallest set of resources that will be allocated to a job. Since jobs are accounted for entire nodes, i.e. 16 (physical cores) times the number of nodes, the chunk size should be equal to one node. This means that multiple jobs should not be run on the same node. Requesting resources at node-level is done using the "select" specification statement followed by the number of chunks and a set of resource requested for each chunk separated by colons. Available resources are listed in the table below.

Keyword	Description
walltime	The wall clock time limit of the job, format `hours:minutes:seconds`
ncpus	Number of cpus in the resource chunk requested
mpiprocs	Number of mpi processes for each chunk
ompthreads	Number of threads per process in a chunk
mem	Memory requested for a chunk. Defaults to `28gb`

Since hyper-threading is enabled by default on compute nodes the number of cpus as seen by the operating system is 32 for each node. Therefore, when allocating nodes to a job always specify "ncpus=32".

E.g. requesting 10 nodes with 16 mpi processes per node:

#PBS -l select=10:ncpus=32:mpiprocs=16

PBS support OpenMP applications by setting the OMP_NUM_THREADS variable automatically based on the resource request of a job. If ompthreads is not used, OMP_NUM_THREADS is set to the value of ncpus. E.g. running 16 threads on one node:

#PBS -l select=1:ncpus=32:ompthreads=16

The walltime resource is specified at job-wide level, e.g. asking for 24 hours:

#PBS -l walltime=24:00:00

Users must specify the wall clock limit for a job. Failing to do so will result in an error:

qsub: Job has no walltime requested

All jobs on the system will be run in the workq queue. This is the default queue.

Job Arrays

Running a number of jobs based on the same job script can be done using the job array feature. A job array represents a collection of subjobs, each with a unique index number. To submit a job array, use the -J <range> option to qsub e.g. in a job script:

#PBS -J 0-1

This example will result in 2 subjobs with indices 0 and 1. The PBS_ARRAY_INDEX environment variable give the subjob index and can be used to run subjobs in individual work directories, load different data files, or any other operation that requires a unique index. E.g. running 2 instances of a program on different data sets

data0.inp
data1.inp

the files can be operated on specifying

data$PBS_ARRAY_INDEX.inp

in the jobscript.

Running the qstat command the job array will be listed with job state B:

$ qstat
69138[].service2   array_job   user        0 B workq

Adding the -t option will show all subjobs:

$ qstat -t
69138[].service2   array_job   user        0 B workq
69138[0].service2  array_job   user 00:21:34 R workq
69138[1].service2  array_job   user 00:21:34 R workq

Job arrays can be deleted specifying the job array identifier:

$ qdel 69138[].service2

or for individual subjobs, e.g.:

$ qdel 69138[1].service2

See sample Job Array Script for a batch job running 4 instances of an MPI executable.

Interactive Batch Job

To run an interacive batch job add the -I option to qsub. When the job is scheduled, the standard input, output and error are sent to the terminal session in which qsub is running. If you need to use a X11 display from within your job, add the -X option. E.g.:

$ qsub -I -X -A <my account> -l select=1:ncpus=32 -l walltime=02:00:00

Further Information

See the PBS Pro User Guide (PDF)

Sample Batch Scripts

MPI Job

Sample job running 1024 MPI processes

#!/bin/bash
#PBS -N my_mpi_job
#PBS -A nn1234k
#PBS -l walltime=24:00:00
#PBS -l select=64:ncpus=32:mpiprocs=16

# Tips to improving performance
#
# 1. Adjusting MPI_BUFS_PER_PROC and MPI_BUFS_PER_HOST.
#
# Use the "-stats" option to mpiexec_mpt to get additional information in the
# output file. Included in that information is the number of retries for
# allocating MPI buffers. After you have executed your program with the "-stats"
# option, you can see this by typing something similar to:
#
# $ cat my_mpi_job.o55809 | grep retries | grep -v " 0 retries"
#
# You can then increase the values of MPI_BUFS_PER_PROC (default 32) and
# MPI_BUFS_PER_HOST (default 96) until the number of retries is sufficiently
# low, e.g. by uncommenting these lines:
#
# export MPI_BUFS_PER_PROC=256
# export MPI_BUFS_PER_HOST=1024
#
# See "man mpi" for more information.
#
#
# 2. Adjusting MPI_BUFFER_MAX
#
# For some codes it gives a significant increase in performance to specify a
# value for MPI_BUFFER_MAX. According to "man mpi" this value "Specifies a
# minimum message size, in bytes, for which the message will be considered a
# candidate for single-copy transfer." The value of MPI_BUFFER_MAX varies from
# program to program, but typical values are between 2048 and 32768. You can
# therefore test if this improves the performance of your program by executing
# it like this:
#
# export MPI_BUFFER_MAX=2048
# time -p mpiexec_mpt ./myprog
#
# See "man mpi" for more information.

module load intelcomp
module load mpt

cd $PBS_O_WORKDIR

mpiexec_mpt ./myprog

OpenMP Job

Running 16 threads on one node

#!/bin/bash
#PBS -N my_openmp_job
#PBS -A nn1234k
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=32:ompthreads=16

module load intelcomp

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=16          # Optional

./myprog

Hybrid MPI/OpenMP Job

Sample job running 128 MPI processes, two on each node, with 8 OpenMP threads per MPI process

#!/bin/bash
#PBS -N my_hybrid_job
#PBS -A nn1234k
#PBS -l walltime=24:00:00
#PBS -l select=64:ncpus=32:mpiprocs=2:ompthreads=8

module load intelcomp
module load mpt

cd $PBS_O_WORKDIR

mpiexec_mpt -n 128 omplace -nt 8 ./myprog

Job Array Script

#!/bin/bash
#PBS -N array_job
#PBS -A nn1234k
#PBS -J 0-3                       # Run 4 subjobs
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=32:mpiprocs=16

module load intelcomp
module load mpt

cd $PBS_O_WORKDIR

# Create a work directory for each subjob:
w=/work/$PBS_O_LOGNAME/$PBS_JOBNAME$PBS_ARRAY_INDEX
if [ ! -d $w ]; then mkdir -p $w; fi

# Copy individual input files to work directories:
cp data$PBS_ARRAY_INDEX.inp $w
cd $w

mpiexec_mpt $PBS_O_HOME/myprog

Application Specific Sample Jobscripts

Abaqus Job

See the Abaqus page.

ADF Job

See the ADF page.

Ansys Mechanical Job

#!/bin/bash
###################################################
#
#  Ansys Mechanical Job
#
###################################################
#
#PBS -N static
#PBS -A nn1234k
#PBS -l select=2:ncpus=32:mpiprocs=16
#PBS -l walltime=24:00:00
#

module load ansys/15.0

case=$PBS_JOBNAME

cd $PBS_O_WORKDIR

# Create (if necessary) the working directory
w=/work/$PBS_O_LOGNAME/ansys/$case
if [ ! -d $w ]; then mkdir -p $w; fi

# Copy inputfile and move to working directory
cp $case.inp $w
cd $w

machines=`uniq -c ${PBS_NODEFILE} | awk '{print $2 ":" $1}' | paste -s -d ':'`

export MPI_WORKDIR=$w

ansys150 -j $case -b -dis -usessh -machines $machines -i $case.inp -o $case.out

CFX Job

#!/bin/bash
###################################################
#
#  Running CFX in distributed parallel mode
#
###################################################
#
#PBS -N Benchmark
#PBS -A nn1234k
#PBS -l select=2:ncpus=32:mpiprocs=16
#PBS -l Fluent=32
#PBS -l walltime=24:00:00
#

module load cfx

case=$PBS_JOBNAME

cd $PBS_O_WORKDIR

# Create (if necessary) the working directory
w=/work/$PBS_O_LOGNAME/cfx/$case
if [ ! -d $w ]; then mkdir -p $w; fi

# Copy inputfile and move to working directory
cp $case.def $w
cd $w

nodes=`cat $PBS_NODEFILE`
nodes=`echo $nodes | sed -e 's/ /,/g'`

export CFX5RSH=ssh

cfx5solve -batch -double -def $case.def \
          -start-method 'Platform MPI Distributed Parallel' -par-dist $nodes

COMSOL Job

Se the COMSOL page.

CP2K Job

See the CP2K page.

Fluent Job

#!/bin/bash
###################################################
#
#  Fluent job using the 2D double precision solver
#
###################################################
#
#PBS -N cavity
#PBS -A nn1234k
#PBS -l select=2:ncpus=32:mpiprocs=16
#PBS -l walltime=24:00:00
#PBS -l Fluent=16
###################################################
# "Fluent=16" ensures that there are 16 Fluent 
# licenses available when the job starts.
# Licenses = <select> x <mpiprocs> - 16
###################################################
#

module load fluent

case=$PBS_JOBNAME

cd $PBS_O_WORKDIR

# Create (if necessary) the working directory
w=/work/$PBS_O_LOGNAME/fluent/$case
if [ ! -d $w ]; then mkdir -p $w; fi

# Copy inputfiles and move to working directory
cp $case.cas $w
cp $case.dat $w
cp fluent.jou $w
cd $w

procs=`cat $PBS_NODEFILE | wc -l`

fluent 2ddp -i fluent.jou -p -t$procs -g -ssh -cnf=$PBS_NODEFILE

NAMD Job

See the NAMD page.

OpenFOAM Job

See the OpenFOAM page.

StarCCM+ Job

See the STAR-CCM+ page.

VASP Job

See the VASP page.

Visit Job

Compiling a Visit example on Vilje

$ mkdir visit-test
$ cd visit-test
$ module load gcc/4.7.1 mpt/2.06 visit/2.6.3
$ module load intelcomp/13.0.1 cmake/2.8.11.2
$ cp -r $VISITHOME/DataManualExamples/Simulations/contrib/pjacobi .
$ cd pjacobi/F90
$ ln -s $VISITHOME/DataManualExamples/Simulations/simulationexamplev2.f .
$ ccmake -DCMAKE_Fortran_COMPILER=mpif90 .

In the ccmake GUI do the following:

Press [c] to configure
Press [c] to configure
Press [e] to exit help
Press [g] to generate and exit

Now compile the code with:

$ make

Running the simulation in batch

Create a file run.pbs that looks like this:

#!/bin/bash
#PBS -N visit
#PBS -A ACCOUNT_NAME
#PBS -l select=1:ncpus=32:mpiprocs=16
#PBS -l walltime=00:30:00
#PBS -m abe
#PBS -j oe

module load gcc/4.7.1 mpt/2.06 visit/2.6.3
module load intelcomp/13.0.1

cd $PBS_O_WORKDIR

mpiexec_mpt pjacobi_visit

Submit the job:

$ chmod +x run.pbs
$ qsub run.pbs

Visualize the simulation

Wait until your job is running and then type:

$ visit

Use VisIt to visualize the data from the simulation:

Click "File" -> "Open File"
Double-click ".." until you reach your home directory
Double-click ".visit"
Double-click "simulations"
Select the file *.pjacobi.sim2 at the bottom of the list
Click "OK"
In left frame "Plots": Add -> Pseudocolor -> temperature
Click "Draw"
Click "File" -> "Simulations..."
Click "run"

Shut down the simulation running in batch:

Click "halt"
Click "Dismiss"
Click "File" -> "Compute engines..."
Click "Disconnect"
Click "OK"
Click "Dismiss"

Shut down VisIt:

Click "File" -> "Exit"

ParaView job for visualising OpenFOAM data

Create the job-script

Create a data directory and move you OpenFOAM data to it:

$ mkdir paraview-test
$ cd paraview-test
$ mkdir data
$ mv OPENFOAM_SIMULATION_DATA data

Create the job-script run.pbs:

#!/bin/bash
#PBS -N ParaView
#PBS -l walltime=02:00:00
#PBS -m abe
#PBS -l select=1:ncpus=32:mpiprocs=16
#PBS -A account_name
# *** HOST_IP must be set before submitting the job ***
#PBS -v HOST_IP
#PBS -q test

module load gcc/4.9.1 mpt/2.13 paraview/5.0.0-batch

workdir=/work/$USER/$PBS_JOBID
mkdir -p $workdir
cp -r $PBS_O_WORKDIR/data $workdir
cd $workdir/data
touch data.foam

mpiexec_mpt pvserver --use-offscreen-rendering --reverse-connection \
                     --server-port=75000 --client-host=${HOST_IP}

Make the run script executable:

$ chmod +x run.pbs

Submit the job:

$ export HOST_IP=`hostname -i`
$ qsub run.pbs

Start the client:

$ module load gcc/4.9.1 paraview/5.0.0-gui
$ paraview

* Click File -> Connect -> Add Server
* Choose Client / Server (reverse connection)
* Set the name
* Set Port = 75000
* Configure
* Choose your new server
* Click Connect
* Click OK

* Click File -> Open
* Choose data.foam
* Click OK

* Click Apply