vLLM can start LLM inferencing on compute node. In this example model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.
Create Python virtual environment and install vLLM:
module load Python/3.12.3-GCCcore-13.3.0
python3 -m venv venv-vllm
source venv-vllm/bin/activate
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129
pip install flashinfer-python
Create Slurm job script vllm.slurm:
#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --account=<GROUP_ACCOUNT>
#SBATCH --time=02:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem=200G
#SBATCH --gres=gpu:2
### GPU with 80GB only. It can be A100 or H100 ###
#SBATCH --constraint="gpu80g&(a100|h100)"
#SBATCH --job-name="vLLM"
#SBATCH --output=vllm.log
module load Python/3.12.3-GCCcore-13.3.0
source venv-vllm/bin/activate
export VLLM_CONFIGURE_LOGGING=1
export TORCH_CUDA_ARCH_LIST=9.0
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B -tp=2 --port 8000 --gpu-memory-utilization 0.97 --max-model-len 8192 --enforce-eager
Submit job:
sbatch vllm.slurm
Check when job is RUNNING:
scontrol show job MY_JOB_ID_NUMBER
Use command tail to read logs in real time:
tail -f vllm.log
And wait until this 2 messages will apear (first time it can take long because of model downloanding.)
(APIServer pid=74175) INFO: Waiting for application startup.
(APIServer pid=74175) INFO: Application startup complete.
vLLM is available from compute node on port 8000. For example for chat from login node:
Python/3.12.3-GCCcore-13.3.0
python3 -m venv venv-vllm
source venv-vllm/bin/activate
vllm chat --url http://idun-xx-xx:8000/v1
Compute node name idun-xx-xx can be found with command:
scontrol show job MY_JOB_ID_NUMBER