Skip to content

Monitor and Manage HPC Job

To check Slurm job status, run below command in the SSH terminal (CLI)


$ squeue -u $USER

To cancel Slurm job, run below command in the SSH terminal (CLI)


Terminal window
$ scancel <slurm-job-id>

How to check Slurm job (completed) efficiency?

Section titled “How to check Slurm job (completed) efficiency?”

To check Slurm job efficiency, run below command in the SSH terminal (CLI)


Terminal window
$ seff <slurm-job-id>

How to check current CPU and Memory usage for a Slurm job (Running)?

Section titled “How to check current CPU and Memory usage for a Slurm job (Running)?”

To check current resources usage for a Slurm job (Running)


Terminal window
$ slurm_job_monitor <slurm-job-id>

How to check current GPU usage for a Slurm job (Running)?

Section titled “How to check current GPU usage for a Slurm job (Running)?”
  1. Check allocated node running Slurm job and job ID by command below

$ squeue -u $USER

JOB_MONITORING_9


  1. Run command below to check GPU usage in Compute Node

$ nv-smid -h <hostname>

  1. From the output, locate the GPU ID that processing Slurm job by PID
  2. Read the output with the corresponding GPU JOB_MONITORING_5 JOB_MONITORING_6

Run below command in the SSH terminal (CLI), for tutorial in accessing CLI, please refer to Shell Access and Useful Command

#Check job history by user in a specific time range and show CPU, Memory and GPU allocation
$ sacct –u $USER --starttime=<log start time> --endtime=<log end time> -X --format=JobID,JobName,Submit,Start,Elapsed,State,AllocCPUS,ReqMem,AllocTRES%30

JOB_MONITORING_7


Scenario: Job Still in “Queued” Status for Long Time

Section titled “Scenario: Job Still in “Queued” Status for Long Time”

Check Reason of Job “Queued” Status

$ squeue -u $USER

JOB_MONITORING_8


Reason List (Example), Full list please visit: SLURM Job Reason Codes

Reason Description
QOSMax* A portion of the job request exceeds a maximum limit (e.g., PerJob, PerNode) for the requested QOS.
Resources The resources requested by the job are not available (e.g., already used by other jobs).
Priority One or more higher priority jobs exist for the partition associated with the job or for the advanced reservation.
QOSJobLimit The job's QOS has reached its maximum job count.
QOSMaxCpuPerJobLimit The CPU request exceeds the maximum each job is allowed to use for the requested QOS.
QOSMaxMemoryPerJob The Memory request exceeds the maximum each job is allowed to use for the requested QOS.
QOSMaxGRESPerJob The GRES request exceeds the maximum each job is allowed to use for the requested QOS.