- Execution of Programs
- Runtime Environment
- Job Accounting
- PBS (Portable Batch System)
- Sample Batch Scripts
Execution of Programs
Notice that "." (dot) representing the current working directory is not added to your default search path (PATH). In order to run executables located in the current working directory add "./" in front of the executable name
or alternatively specify the absolute path to the executable.
Before running multi-threaded code set the number of threads using the OMP_NUM_THREADS environment variable. E.g.
See sample Openmp Job script for running batch jobs.
mpirun command to start MPI programs. E.g. running 8 MPI instances of myprog
For a complete specification of the option list, see the man mpirun page.
When running in the batch system the
mpiexec command provided with PBS Pro is a wrapper script that assembles the correct host list and corresponding
mpirun command before executing the assembled
mpirun command. The
mpiexec_mpt command that comes with SGI MPT is an alternative to
mpiexec. Unlike the
mpiexec_mpt supports all MPT
mpirun global options. See sample batch MPI Job script below. For more information see the man mpiexec_mpt page.
Hybrid MPI/OpenMP Applications
omplace command causes the successive threads in a hybrid MPI/OpenMP job to be placed on unique CPUs. For example, running a 2-process MPI job with 3 threads per process
the threads would be placed as follows
When you build an application with a particular compiler and/or libraries with dynamic linking of libraries at run time, make sure the same compiler and library modules are loaded when running the application.
Binary Data (Endianess)
Vilje will by default write and read Fortran sequential unformatted files in little-endian format. The Intel Fortran compiler can write and read big-endian files using a little-endian-to-big-endian conversion feature. To use big-endian unformatted files (e.g. created on Njord) for I/O on Vilje use the F_UFMTENDIAN environment variable on the numbers of the units to be used for conversion. Example, do big-endian-to-little-endian conversion on files with unit numbers 10 and 20:
See the Intel® Fortran Compiler User and Reference Guides for more information.
Hyper-threading is enabled by default on compute nodes. I.e. for each physical processor core the operating system sees two virtual processors. This means for each compute node there are 32 virtual processors present.
MPI Run-time Environment
MPI keeps track of certain resource utilization statistics. These can be used to determine potential performance problems caused by lack of MPI message buffers and other MPI internal resources. To turn on the displaying of MPI internal statistics, use the
-stats option on the
mpirun) command, or set the MPI_STATS variable on:
MPI internal statistics are always being gathered, so displaying them does not cause significant additional overhead. By default data is sent to stderr. This can be changed by specifying the MPI_STATS_FILE variable. This file is written to by all processes, each line is prefixed by the host number and global rank of the process.
MPI Buffer Resources
By default, the SGI MPI implementation buffers messages whose lengths exceed 64 bytes into a shared memory region to allow for exchange of data between MPI processes. Because there is a finite number of these shared memory buffers, this can be a constraint on the overall application performance for certain communication patterns. If the MPI statistics file include lines with high numbers of
increase the numbers of these buffers using the MPI_BUFS_PER_PROC (defaults to 32) and/or MPI_BUFS_PER_HOST (defaults to 96) variables respectively. E.g.
Keep in mind that increasing the number of buffers does consume more memory.
Single Copy Optimization
For message transfers between MPI processes it is possible under certain conditions to avoid the need to buffer messages. This single copy technique, using memory mapping, may result in better performance since it improves MPI’s bandwidth. Using memory mapping within SGI MPT is disabled by default on Vilje. To enable it, specify
The size of messages for which MPI attempts to use the single copy method is controlled by the MPI_BUFFER_MAX variable. In general, a value of 2000 or higher is beneficial for many applications. E.g.
i.e. MPI should try to avoid buffering for messages larger than 2048 bytes. Highly synchronized applications that perform large message transfers can benefit from the single-copy method. However, single copy transfers may introduce additional synchronization points, which can reduce application performance in some cases.
If you are running less than 16 MPI tasks per node, e.g. running a MPI/OpenMP hybrid code or if the available memory per node is insufficient for running 16 MPI tasks, you should distribute the processes evenly between the two sockets using the MPI_DSM_CPULIST environment variable. Specify in the jobscript:
The list specified above will work for any number of MPI tasks (up to 16) specified per node. This will place MPI ranks 0,2,4,...14 on socket 1 (with cpuids 0,1,...7) and MPI ranks 1,3,5,...15 on socket 2 (with cpuids 8,9,...15). Specifying allhosts indicates that the cpu list pattern applies to all compute nodes allocated for the job.
Setting the MPI_DSM_VERBOSE variable will direct MPI to print the host placement information to the standard error file:
Notice, there is no need to specify the MPI_DSM_CPULIST variable when running 16 MPI tasks on each node.
For a full list of the MPI environment settings, see the 'ENVIRONMENT VARIABLES' section in the man mpi page.
Jobs are accounted for the wall clock time used to finish the job, multiplied with 16 (the number of physical cores per node) times the number of nodes reserved.
For an overview of CPU-time charged to account(s) you are a member of use command
cost. Add --help for a description of available options:
PBS (Portable Batch System)
Job Scheduling Policy
Jobs are ordered by submission time and tentatively scheduled in the sorted order, but with backfilling enabled. The scheduler will traverse the list of jobs in order to backfill, i.e. use available resources as long as the estimated start times of earlier jobs in the list is not pushed into the future.
Submit a job
Delete a job
Request the status of jobs
For example, submit a job using a job script
Check the status of my job:
Job Submission Options
Below is a list of some of the options for the
qsub command. For a complete list of all options see the 'man qsub' page.
Specify a name for the job
Specify the account name
Specify a name for the stdout file
Specify a name for the stderr file
Specify email notification when the job starts (b), ends (e), or if it aborts (a)
Specify the email address to send notification
|-I||Job is to be run interactively|
|-X||Enable X11 forwarding|
Specify the set of resources requested for the job (see Consumable Resources below)
A chunk is the smallest set of resources that will be allocated to a job. Since jobs are accounted for entire nodes, i.e. 16 (physical cores) times the number of nodes, the chunk size should be equal to one node. This means that multiple jobs should not be run on the same node. Requesting resources at node-level is done using the "select" specification statement followed by the number of chunks and a set of resource requested for each chunk separated by colons. Available resources are listed in the table below.
The wall clock time limit of the job, format
Number of cpus in the resource chunk requested
Number of mpi processes for each chunk
Number of threads per process in a chunk
Memory requested for a chunk. Defaults to
Since hyper-threading is enabled by default on compute nodes the number of cpus as seen by the operating system is 32 for each node. Therefore, when allocating nodes to a job always specify "ncpus=32".
E.g. requesting 10 nodes with 16 mpi processes per node:
PBS support OpenMP applications by setting the OMP_NUM_THREADS variable automatically based on the resource request of a job. If ompthreads is not used, OMP_NUM_THREADS is set to the value of ncpus. E.g. running 16 threads on one node:
The walltime resource is specified at job-wide level, e.g. asking for 24 hours:
Users must specify the wall clock limit for a job. Failing to do so will result in an error:
All jobs on the system will be run in the workq queue. This is the default queue.
Running a number of jobs based on the same job script can be done using the job array feature. A job array represents a collection of subjobs, each with a unique index number. To submit a job array, use the
-J <range> option to
qsub e.g. in a job script:
This example will result in 2 subjobs with indices 0 and 1. The PBS_ARRAY_INDEX environment variable give the subjob index and can be used to run subjobs in individual work directories, load different data files, or any other operation that requires a unique index. E.g. running 2 instances of a program on different data sets
the files can be operated on specifying
in the jobscript.
qstat command the job array will be listed with job state B:
-t option will show all subjobs:
Job arrays can be deleted specifying the job array identifier:
or for individual subjobs, e.g.:
See sample Job Array Script for a batch job running 4 instances of an MPI executable.
Interactive Batch Job
To run an interacive batch job add the
-I option to
qsub. When the job is scheduled, the standard input, output and error are sent to the terminal session in which
qsub is running. If you need to use a X11 display from within your job, add the
-X option. E.g.:
- See the PBS Pro User Guide (PDF)
Sample Batch Scripts
Sample job running 1024 MPI processes
Running 16 threads on one node
Hybrid MPI/OpenMP Job
Sample job running 128 MPI processes, two on each node, with 8 OpenMP threads per MPI process
Job Array Script
Application Specific Sample Jobscripts
See the Abaqus page.
See the ADF page.
Ansys Mechanical Job
Se the COMSOL page.
See the CP2K page.
See the NAMD page.
See the OpenFOAM page.
See the STAR-CCM+ page.
See the VASP page.
Compiling a Visit example on Vilje
In the ccmake GUI do the following:
Now compile the code with:
Running the simulation in batch
Create a file run.pbs that looks like this:
Submit the job:
Visualize the simulation
Wait until your job is running and then type:
Use VisIt to visualize the data from the simulation:
Shut down the simulation running in batch:
Shut down VisIt:
ParaView job for visualising OpenFOAM data
Create the job-script
Create a data directory and move you OpenFOAM data to it:
Create the job-script run.pbs:
Make the run script executable: