vLLM – High Performance Computing

vLLM can start LLM inferencing on compute node. In this example model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.

Create Python virtual environment and install vLLM:

module load Python/3.12.3-GCCcore-13.3.0
python3 -m venv venv-vllm
source venv-vllm/bin/activate
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129
pip install flashinfer-python

Create Slurm job script vllm.slurm:

#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --account=<GROUP_ACCOUNT>
#SBATCH --time=02:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem=200G
#SBATCH --gres=gpu:2
### GPU with 80GB only. It can be A100 or H100 ###
#SBATCH --constraint="gpu80g&(a100|h100)"
#SBATCH --job-name="vLLM"
#SBATCH --output=vllm.log
module load Python/3.12.3-GCCcore-13.3.0
source venv-vllm/bin/activate
export VLLM_CONFIGURE_LOGGING=1
export TORCH_CUDA_ARCH_LIST=9.0
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B -tp=2 --port 8000 --gpu-memory-utilization 0.97 --max-model-len 8192 --enforce-eager

Submit job:

sbatch vllm.slurm

Check when job is RUNNING:

scontrol show job MY_JOB_ID_NUMBER

Use command tail to read logs in real time:

tail -f vllm.log

And wait until this 2 messages will apear (first time it can take long because of model downloanding.)

(APIServer pid=74175) INFO:     Waiting for application startup.
(APIServer pid=74175) INFO:     Application startup complete.

vLLM is available from compute node on port 8000. For example for chat from login node:

Python/3.12.3-GCCcore-13.3.0
python3 -m venv venv-vllm
source venv-vllm/bin/activate
vllm chat --url http://idun-xx-xx:8000/v1

Compute node name idun-xx-xx can be found with command:

scontrol show job MY_JOB_ID_NUMBER