How do I view Job Resource Usage?

1. Check resource usage for COMPLETED job with command "sacct". 

Example with Job ID 19361471:

sacct -j 19361471 --format="JobID,JobName%30,Start,Elapsed,ReqTRES%45,TRESUsageInMax%110,State"

Output is long. Screenshot:

  • Job requested 1 GPU: ReqTRES ( gres/gpu=1 )
    But GPU utilization is zero: TRESUsageInMax ( gres/gpumem=0,gres/gpuutil=0 )
  • Job requested 300G memory: ReqTRES ( mem=300G )
    But max used was: TRESUsageInMax ( mem=86980388K ) about 83G
  • CPU utilization for this job is close to max:
    Job requested 3 CPU cores: ReqTRES ( cpu=3 )
    Job was running for: Elapsed (2-10:00:08)
    Job used: TRESUsageInMax (cpu=6-17:14:38)

You can see all collected information about this job with this command:

sacct -j 19361471 --format="ALL"

Some fields are long you need to change length with present sign. Example 150 character length %150. Example:

sacct -j 19361471 --format="ALL%150"

2. Check resource usage for running job.

Example. User hpcuser has running job:

$ squeue -u hpcuser 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19380572      GPUQ my_job01  hpcuser  R 1-14:49:39      1 idun-04-07

Job ID 19380572 requested: 3 CPU cores, 100G memory and 1 GPU.

$ scontrol show job 19380572 | grep ReqTRES
   ReqTRES=cpu=3,mem=100G,node=1,billing=3,gres/gpu=1

You can login compute node via ssh and check how job is running:

[hpcuser@idun-login2 ~]$ ssh idun-04-07

[hpcuser@idun-04-07 ~]$ top -u hpcuser 
. . .
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2454197 hpcuser   20   0 2396196   2.0g 888252 R 100.0   0.1   1370:12 python3
2756123 hpcuser   20   0 2396200   2.0g 888236 R 100.0   0.1 402:48.31 python3
2795017 hpcuser   20   0   20560   5348   3920 R   6.2   0.0   0:00.02 top
 571974 hpcuser   20   0    7696   4016   3216 S   0.0   0.0   0:00.00 slurm_script
 576324 hpcuser   20   0 9065364   5.0g   3260 R   0.0   0.3   2:12.09 python3
 582463 hpcuser   20   0 2719500  29016   2544 S   0.0   0.0   0:00.20 python3
 582517 hpcuser   20   0   30004  18856   2212 S   0.0   0.0   0:00.03 python3
2794958 hpcuser   20   0   48652   7004   4716 S   0.0   0.0   0:00.00 sshd
2794959 hpcuser   20   0   17048   4980   3872 S   0.0   0.0   0:00.00 bash

2 processes are on 100% CPU cores utilisation but they are using only about 4G memory.

We can check what was the peak memory usage for these processes from the start:

[hpcuser@idun-04-07 ~]$ grep VmPeak /proc/2454197/status
VmPeak:  2682100 kB

[hpcuser@idun-04-07 ~]$ grep VmPeak /proc/2756123/status 
VmPeak:  2682096 kB

Check GPU utilization with command "nvidia-smi" or "nvtop":

[hpcuser@idun-04-07 ~]$ nvidia-smi
Tue May  7 09:21:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:89:00.0 Off |                    0 |
| N/A   26C    P0              32W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

GPU is idle.

Scroll to Top