- Execution of Programs
- Runtime Environment
- Job Accounting
- PBS (Portable Batch System)
- Sample Batch Scripts
Execution of Programs
Notice that "." (dot) representing the current working directory is not added to your default search path (PATH). In order to run executables located in the current working directory add "./" in front of the executable name
$ ./myprog
or alternatively specify the absolute path to the executable.
OpenMP Applications
Before running multi-threaded code set the number of threads using the OMP_NUM_THREADS environment variable. E.g.
$ export OMP_NUM_THREADS=4 $ ./myprog
See sample Openmp Job script for running batch jobs.
MPI Applications
Use the mpirun
command to start MPI programs. E.g. running 8 MPI instances of myprog
$ mpirun -np 8 ./myprog
For a complete specification of the option list, see the man mpirun page.
When running in the batch system the mpiexec
command provided with PBS Pro is a wrapper script that assembles the correct host list and corresponding mpirun
command before executing the assembled mpirun
command. The mpiexec_mpt
command that comes with SGI MPT is an alternative to mpiexec
. Unlike the mpiexec
command mpiexec_mpt
supports all MPT mpirun
global options. See sample batch MPI Job script below. For more information see the man mpiexec_mpt page.
Hybrid MPI/OpenMP Applications
The omplace
command causes the successive threads in a hybrid MPI/OpenMP job to be placed on unique CPUs. For example, running a 2-process MPI job with 3 threads per process
$ mpirun -np 2 omplace -nt 3 ./myprog
the threads would be placed as follows
rank 0 thread 0 on CPU 0 rank 0 thread 1 on CPU 1 rank 0 thread 2 on CPU 2 rank 1 thread 0 on CPU 3 rank 1 thread 1 on CPU 4 rank 1 thread 2 on CPU 5
See sample Hybrid MPI/OpenMP Job script below for running a batch job. For more information see the man omplace page.
Runtime Environment
Dynamic Libraries
When you build an application with a particular compiler and/or libraries with dynamic linking of libraries at run time, make sure the same compiler and library modules are loaded when running the application.
Binary Data (Endianess)
Vilje will by default write and read Fortran sequential unformatted files in little-endian format. The Intel Fortran compiler can write and read big-endian files using a little-endian-to-big-endian conversion feature. To use big-endian unformatted files (e.g. created on Njord) for I/O on Vilje use the F_UFMTENDIAN environment variable on the numbers of the units to be used for conversion. Example, do big-endian-to-little-endian conversion on files with unit numbers 10 and 20:
$ export F_UFMTENDIAN=10,20
See the Intel® Fortran Compiler User and Reference Guides for more information.
Hyper-threading
Hyper-threading is enabled by default on compute nodes. I.e. for each physical processor core the operating system sees two virtual processors. This means for each compute node there are 32 virtual processors present.
MPI Run-time Environment
MPI keeps track of certain resource utilization statistics. These can be used to determine potential performance problems caused by lack of MPI message buffers and other MPI internal resources. To turn on the displaying of MPI internal statistics, use the -stats
option on the mpiexec_mpt
(or mpirun
) command, or set the MPI_STATS variable on:
$ export MPI_STATS=1
MPI internal statistics are always being gathered, so displaying them does not cause significant additional overhead. By default data is sent to stderr. This can be changed by specifying the MPI_STATS_FILE variable. This file is written to by all processes, each line is prefixed by the host number and global rank of the process.
MPI Buffer Resources
By default, the SGI MPI implementation buffers messages whose lengths exceed 64 bytes into a shared memory region to allow for exchange of data between MPI processes. Because there is a finite number of these shared memory buffers, this can be a constraint on the overall application performance for certain communication patterns. If the MPI statistics file include lines with high numbers of
...retries allocating mpi PER_PROC buffers...
and/or
...retries allocating mpi PER_HOST buffers...
increase the numbers of these buffers using the MPI_BUFS_PER_PROC (defaults to 32) and/or MPI_BUFS_PER_HOST (defaults to 96) variables respectively. E.g.
$ export MPI_BUFS_PER_PROC=512
Keep in mind that increasing the number of buffers does consume more memory.
Single Copy Optimization
For message transfers between MPI processes it is possible under certain conditions to avoid the need to buffer messages. This single copy technique, using memory mapping, may result in better performance since it improves MPI’s bandwidth. Using memory mapping within SGI MPT is disabled by default on Vilje. To enable it, specify
$ export MPI_MEMMAP_OFF=0
The size of messages for which MPI attempts to use the single copy method is controlled by the MPI_BUFFER_MAX variable. In general, a value of 2000 or higher is beneficial for many applications. E.g.
$ export MPI_BUFFER_MAX=2048
i.e. MPI should try to avoid buffering for messages larger than 2048 bytes. Highly synchronized applications that perform large message transfers can benefit from the single-copy method. However, single copy transfers may introduce additional synchronization points, which can reduce application performance in some cases.
Process Placement
If you are running less than 16 MPI tasks per node, e.g. running a MPI/OpenMP hybrid code or if the available memory per node is insufficient for running 16 MPI tasks, you should distribute the processes evenly between the two sockets using the MPI_DSM_CPULIST environment variable. Specify in the jobscript:
$ export MPI_DSM_CPULIST=0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15:allhosts
The list specified above will work for any number of MPI tasks (up to 16) specified per node. This will place MPI ranks 0,2,4,...14 on socket 1 (with cpuids 0,1,...7) and MPI ranks 1,3,5,...15 on socket 2 (with cpuids 8,9,...15). Specifying allhosts indicates that the cpu list pattern applies to all compute nodes allocated for the job.
Setting the MPI_DSM_VERBOSE variable will direct MPI to print the host placement information to the standard error file:
$ export MPI_DSM_VERBOSE=1
Notice, there is no need to specify the MPI_DSM_CPULIST variable when running 16 MPI tasks on each node.
For a full list of the MPI environment settings, see the 'ENVIRONMENT VARIABLES' section in the man mpi page.
Job Accounting
Jobs are accounted for the wall clock time used to finish the job, multiplied with 16 (the number of physical cores per node) times the number of nodes reserved.
For an overview of CPU-time charged to account(s) you are a member of use command cost
. Add --help for a description of available options:
$ cost --help Report processor core hour usage for a user or project. options: -h, --help Show this help message and exit -p ACCOUNT, --project=ACCOUNT Account/project for which to report accounting data. If no argument is given, all projects with user $USER arelisted. -P PERIOD, --period=PERIOD Show information for specified NOTUR period in format YYYY.P, e.g. 2010.1 NB. If used with -p, will show only information regarding users currently accessible projects. -s Report in seconds (default: hours). Without options, the command shows the core hours used by the user for all projects that the user account is connected to. With option -p, the command shows for the specified project, the total number of core hours used, the total number of core hours reserved by unfinished jobs, the quota with prioritized core hours and the quota with unprioritized core hours.
PBS (Portable Batch System)
Job Scheduling Policy
Jobs are ordered by submission time and tentatively scheduled in the sorted order, but with backfilling enabled. The scheduler will traverse the list of jobs in order to backfill, i.e. use available resources as long as the estimated start times of earlier jobs in the list is not pushed into the future.
PBS Commands
Command |
Description |
---|---|
qsub |
Submit a job |
|
Delete a job |
qstat |
Request the status of jobs |
For example, submit a job using a job script myjob.pbs
:
$ qsub myjob.pbs 1234567.service2
Check the status of my job:
$ qstat -f 1234567.service2
Job Submission Options
Below is a list of some of the options for the qsub
command. For a complete list of all options see the 'man qsub' page.
Option |
Description |
---|---|
-N <name> |
Specify a name for the job |
-A <account> |
Specify the account name |
-o <path> |
Specify a name for the stdout file |
-e <path> |
Specify a name for the stderr file |
-m [a|b|e] |
Specify email notification when the job starts (b), ends (e), or if it aborts (a) |
-M <email> |
Specify the email address to send notification |
-I | Job is to be run interactively |
-X | Enable X11 forwarding |
-l resource_list |
Specify the set of resources requested for the job (see Consumable Resources below) |
Consumable Resources
A chunk is the smallest set of resources that will be allocated to a job. Since jobs are accounted for entire nodes, i.e. 16 (physical cores) times the number of nodes, the chunk size should be equal to one node. This means that multiple jobs should not be run on the same node. Requesting resources at node-level is done using the "select" specification statement followed by the number of chunks and a set of resource requested for each chunk separated by colons. Available resources are listed in the table below.
Keyword |
Description |
---|---|
walltime |
The wall clock time limit of the job, format |
ncpus |
Number of cpus in the resource chunk requested |
mpiprocs |
Number of mpi processes for each chunk |
ompthreads |
Number of threads per process in a chunk |
mem |
Memory requested for a chunk. Defaults to |
Since hyper-threading is enabled by default on compute nodes the number of cpus as seen by the operating system is 32 for each node. Therefore, when allocating nodes to a job always specify "ncpus=32".
E.g. requesting 10 nodes with 16 mpi processes per node:
#PBS -l select=10:ncpus=32:mpiprocs=16
PBS support OpenMP applications by setting the OMP_NUM_THREADS variable automatically based on the resource request of a job. If ompthreads is not used, OMP_NUM_THREADS is set to the value of ncpus. E.g. running 16 threads on one node:
#PBS -l select=1:ncpus=32:ompthreads=16
The walltime resource is specified at job-wide level, e.g. asking for 24 hours:
#PBS -l walltime=24:00:00
Users must specify the wall clock limit for a job. Failing to do so will result in an error:
qsub: Job has no walltime requested
All jobs on the system will be run in the workq queue. This is the default queue.
Job Arrays
Running a number of jobs based on the same job script can be done using the job array feature. A job array represents a collection of subjobs, each with a unique index number. To submit a job array, use the -J <range>
option to qsub
e.g. in a job script:
#PBS -J 0-1
This example will result in 2 subjobs with indices 0 and 1. The PBS_ARRAY_INDEX environment variable give the subjob index and can be used to run subjobs in individual work directories, load different data files, or any other operation that requires a unique index. E.g. running 2 instances of a program on different data sets
data0.inp data1.inp
the files can be operated on specifying
data$PBS_ARRAY_INDEX.inp
in the jobscript.
Running the qstat
command the job array will be listed with job state B:
$ qstat 69138[].service2 array_job user 0 B workq
Adding the -t
option will show all subjobs:
$ qstat -t 69138[].service2 array_job user 0 B workq 69138[0].service2 array_job user 00:21:34 R workq 69138[1].service2 array_job user 00:21:34 R workq
Job arrays can be deleted specifying the job array identifier:
$ qdel 69138[].service2
or for individual subjobs, e.g.:
$ qdel 69138[1].service2
See sample Job Array Script for a batch job running 4 instances of an MPI executable.
Interactive Batch Job
To run an interacive batch job add the -I
option to qsub
. When the job is scheduled, the standard input, output and error are sent to the terminal session in which qsub
is running. If you need to use a X11 display from within your job, add the -X
option. E.g.:
$ qsub -I -X -A <my account> -l select=1:ncpus=32 -l walltime=02:00:00
Further Information
- See the PBS Pro User Guide (PDF)
Sample Batch Scripts
MPI Job
Sample job running 1024 MPI processes
#!/bin/bash #PBS -N my_mpi_job #PBS -A nn1234k #PBS -l walltime=24:00:00 #PBS -l select=64:ncpus=32:mpiprocs=16 # Tips to improving performance # # 1. Adjusting MPI_BUFS_PER_PROC and MPI_BUFS_PER_HOST. # # Use the "-stats" option to mpiexec_mpt to get additional information in the # output file. Included in that information is the number of retries for # allocating MPI buffers. After you have executed your program with the "-stats" # option, you can see this by typing something similar to: # # $ cat my_mpi_job.o55809 | grep retries | grep -v " 0 retries" # # You can then increase the values of MPI_BUFS_PER_PROC (default 32) and # MPI_BUFS_PER_HOST (default 96) until the number of retries is sufficiently # low, e.g. by uncommenting these lines: # # export MPI_BUFS_PER_PROC=256 # export MPI_BUFS_PER_HOST=1024 # # See "man mpi" for more information. # # # 2. Adjusting MPI_BUFFER_MAX # # For some codes it gives a significant increase in performance to specify a # value for MPI_BUFFER_MAX. According to "man mpi" this value "Specifies a # minimum message size, in bytes, for which the message will be considered a # candidate for single-copy transfer." The value of MPI_BUFFER_MAX varies from # program to program, but typical values are between 2048 and 32768. You can # therefore test if this improves the performance of your program by executing # it like this: # # export MPI_BUFFER_MAX=2048 # time -p mpiexec_mpt ./myprog # # See "man mpi" for more information. module load intelcomp module load mpt cd $PBS_O_WORKDIR mpiexec_mpt ./myprog
OpenMP Job
Running 16 threads on one node
#!/bin/bash #PBS -N my_openmp_job #PBS -A nn1234k #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=32:ompthreads=16 module load intelcomp cd $PBS_O_WORKDIR export OMP_NUM_THREADS=16 # Optional ./myprog
Hybrid MPI/OpenMP Job
Sample job running 128 MPI processes, two on each node, with 8 OpenMP threads per MPI process
#!/bin/bash #PBS -N my_hybrid_job #PBS -A nn1234k #PBS -l walltime=24:00:00 #PBS -l select=64:ncpus=32:mpiprocs=2:ompthreads=8 module load intelcomp module load mpt cd $PBS_O_WORKDIR mpiexec_mpt -n 128 omplace -nt 8 ./myprog
Job Array Script
#!/bin/bash #PBS -N array_job #PBS -A nn1234k #PBS -J 0-3 # Run 4 subjobs #PBS -l walltime=24:00:00 #PBS -l select=1:ncpus=32:mpiprocs=16 module load intelcomp module load mpt cd $PBS_O_WORKDIR # Create a work directory for each subjob: w=/work/$PBS_O_LOGNAME/$PBS_JOBNAME$PBS_ARRAY_INDEX if [ ! -d $w ]; then mkdir -p $w; fi # Copy individual input files to work directories: cp data$PBS_ARRAY_INDEX.inp $w cd $w mpiexec_mpt $PBS_O_HOME/myprog
Application Specific Sample Jobscripts
Abaqus Job
See the Abaqus page.
ADF Job
See the ADF page.
Ansys Mechanical Job
#!/bin/bash ################################################### # # Ansys Mechanical Job # ################################################### # #PBS -N static #PBS -A nn1234k #PBS -l select=2:ncpus=32:mpiprocs=16 #PBS -l walltime=24:00:00 # module load ansys/15.0 case=$PBS_JOBNAME cd $PBS_O_WORKDIR # Create (if necessary) the working directory w=/work/$PBS_O_LOGNAME/ansys/$case if [ ! -d $w ]; then mkdir -p $w; fi # Copy inputfile and move to working directory cp $case.inp $w cd $w machines=`uniq -c ${PBS_NODEFILE} | awk '{print $2 ":" $1}' | paste -s -d ':'` export MPI_WORKDIR=$w ansys150 -j $case -b -dis -usessh -machines $machines -i $case.inp -o $case.out
CFX Job
#!/bin/bash ################################################### # # Running CFX in distributed parallel mode # ################################################### # #PBS -N Benchmark #PBS -A nn1234k #PBS -l select=2:ncpus=32:mpiprocs=16 #PBS -l Fluent=32 #PBS -l walltime=24:00:00 # module load cfx case=$PBS_JOBNAME cd $PBS_O_WORKDIR # Create (if necessary) the working directory w=/work/$PBS_O_LOGNAME/cfx/$case if [ ! -d $w ]; then mkdir -p $w; fi # Copy inputfile and move to working directory cp $case.def $w cd $w nodes=`cat $PBS_NODEFILE` nodes=`echo $nodes | sed -e 's/ /,/g'` export CFX5RSH=ssh cfx5solve -batch -double -def $case.def \ -start-method 'Platform MPI Distributed Parallel' -par-dist $nodes
COMSOL Job
Se the COMSOL page.
CP2K Job
See the CP2K page.
Fluent Job
#!/bin/bash ################################################### # # Fluent job using the 2D double precision solver # ################################################### # #PBS -N cavity #PBS -A nn1234k #PBS -l select=2:ncpus=32:mpiprocs=16 #PBS -l walltime=24:00:00 #PBS -l Fluent=16 ################################################### # "Fluent=16" ensures that there are 16 Fluent # licenses available when the job starts. # Licenses = <select> x <mpiprocs> - 16 ################################################### # module load fluent case=$PBS_JOBNAME cd $PBS_O_WORKDIR # Create (if necessary) the working directory w=/work/$PBS_O_LOGNAME/fluent/$case if [ ! -d $w ]; then mkdir -p $w; fi # Copy inputfiles and move to working directory cp $case.cas $w cp $case.dat $w cp fluent.jou $w cd $w procs=`cat $PBS_NODEFILE | wc -l` fluent 2ddp -i fluent.jou -p -t$procs -g -ssh -cnf=$PBS_NODEFILE
NAMD Job
See the NAMD page.
OpenFOAM Job
See the OpenFOAM page.
StarCCM+ Job
See the STAR-CCM+ page.
VASP Job
See the VASP page.
Visit Job
Compiling a Visit example on Vilje
$ mkdir visit-test $ cd visit-test $ module load gcc/4.7.1 mpt/2.06 visit/2.6.3 $ module load intelcomp/13.0.1 cmake/2.8.11.2 $ cp -r $VISITHOME/DataManualExamples/Simulations/contrib/pjacobi . $ cd pjacobi/F90 $ ln -s $VISITHOME/DataManualExamples/Simulations/simulationexamplev2.f . $ ccmake -DCMAKE_Fortran_COMPILER=mpif90 .
In the ccmake GUI do the following:
Press [c] to configure Press [c] to configure Press [e] to exit help Press [g] to generate and exit
Now compile the code with:
$ make
Running the simulation in batch
Create a file run.pbs that looks like this:
#!/bin/bash #PBS -N visit #PBS -A ACCOUNT_NAME #PBS -l select=1:ncpus=32:mpiprocs=16 #PBS -l walltime=00:30:00 #PBS -m abe #PBS -j oe module load gcc/4.7.1 mpt/2.06 visit/2.6.3 module load intelcomp/13.0.1 cd $PBS_O_WORKDIR mpiexec_mpt pjacobi_visit
Submit the job:
$ chmod +x run.pbs $ qsub run.pbs
Visualize the simulation
Wait until your job is running and then type:
$ visit
Use VisIt to visualize the data from the simulation:
Click "File" -> "Open File" Double-click ".." until you reach your home directory Double-click ".visit" Double-click "simulations" Select the file *.pjacobi.sim2 at the bottom of the list Click "OK" In left frame "Plots": Add -> Pseudocolor -> temperature Click "Draw" Click "File" -> "Simulations..." Click "run"
Shut down the simulation running in batch:
Click "halt" Click "Dismiss" Click "File" -> "Compute engines..." Click "Disconnect" Click "OK" Click "Dismiss"
Shut down VisIt:
Click "File" -> "Exit"
ParaView job for visualising OpenFOAM data
Create the job-script
Create a data directory and move you OpenFOAM data to it:
$ mkdir paraview-test $ cd paraview-test $ mkdir data $ mv OPENFOAM_SIMULATION_DATA data
Create the job-script run.pbs:
#!/bin/bash #PBS -N ParaView #PBS -l walltime=02:00:00 #PBS -m abe #PBS -l select=1:ncpus=32:mpiprocs=16 #PBS -A account_name # *** HOST_IP must be set before submitting the job *** #PBS -v HOST_IP #PBS -q test module load gcc/4.9.1 mpt/2.13 paraview/5.0.0-batch workdir=/work/$USER/$PBS_JOBID mkdir -p $workdir cp -r $PBS_O_WORKDIR/data $workdir cd $workdir/data touch data.foam mpiexec_mpt pvserver --use-offscreen-rendering --reverse-connection \ --server-port=75000 --client-host=${HOST_IP}
Make the run script executable:
$ chmod +x run.pbs
Submit the job:
$ export HOST_IP=`hostname -i` $ qsub run.pbs
Start the client:
$ module load gcc/4.9.1 paraview/5.0.0-gui $ paraview
* Click File -> Connect -> Add Server * Choose Client / Server (reverse connection) * Set the name * Set Port = 75000 * Configure * Choose your new server * Click Connect * Click OK * Click File -> Open * Choose data.foam * Click OK * Click Apply