vLLM (Self-Install)

vLLM is an open-source, high-performance inference engine designed for large language models (LLMs).

It significantly improves throughput and memory efficiency through techniques like PagedAttention, enabling faster serving of models (such as Llama, Mistral, or Qwen) while supporting many concurrent requests. It is widely used for production API endpoints and cost-effective deployment.

Install vLLM Through Anaconda

To install vLLM, run below command in the SSH terminal (CLI), for tutorial in accessing CLI, please refer to SSH Shell Access to EdUHK HPC Platform and Cluster (Web-based Shell Access)

# Load Anaconda module
$ module load anaconda

# Create new Anaconda environment
$ conda create -n <Environment Name> <Conda Packages>

# Create new Anaconda environment (Example)
$ conda create -n vllm_test python=3.12 pip libstdcxx-ng

# Install vLLM and dependency (Example)
$ ${CONDA_ENV_PATH}/<Environment Name>/bin/pip install <Python Packages>

# Install vLLM and dependency (Example)
$ ${CONDA_ENV_PATH}/vllm_test/bin/pip install vllm torch tensorflow

# Create a folder for vLLM model and cache (Example)
$ mkdir /home/$USER/vllm_test

# Purge module
$ module purge

Prompt AI Model: Prepare SLURM Job Script

Pre-configured template path → /home/$USER/job_template/run_vllm.sh

#!/bin/bash
#SBATCH --job-name=run_vllm ## Job Name
#SBATCH --partition=shared_gpu_l40 ## Partition for Running Job
#SBATCH --nodes=1 ## Number of Compute Node
#SBATCH --ntasks=1 # Number of Tasks
#SBATCH --cpus-per-task=8 ## Number of CPU per task
#SBATCH --time=00:60:00 ## Job Time Limit (i.e. 60 Minutes)
#SBATCH --gres=gpu:l40:1 # Number of GPUs (i.e. 1 x l40 GPU)
#SBATCH --mem=40GB ## Total Memory for Job
#SBATCH --output=./%x_%j.out ## Output File Path
#SBATCH --error=./%x_%j.err ## Error Log Path

## Initiate Environment Module
source /usr/share/modules/init/profile.sh

## Reset the Environment Module components
module purge

## Load Required Module
module load anaconda

## Setup Environment Variable For vllm
export LD_LIBRARY_PATH="${CONDA_ENV_PATH}/vllm_test/lib:$LD_LIBRARY_PATH"
export HUGGING_FACE_HUB_TOKEN="<Hagging Face Access Tokens>"
export VLLM_CACHE_ROOT="/home/$USER/vllm_test/cache"
export HF_HOME="/home/$USER/vllm_test/model"

# Start vllm prompt (Example: input test_prompt.json, result save as results.json)
${CONDA_ENV_PATH}/vllm_test/bin/vllm run-batch \
-i /home/$USER/vllm_test/test_prompt.json \
-o results.json \
--model google/gemma-3-4b-it

## Clear Environment Module Components
module purge

Prompt AI Model: Example Prompt File (test_prompt.json) For vLLM

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "google/gemma-3-4b-it", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "How to make a pizza?"}],"max_completion_tokens": 1000}}

{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "google/gemma-3-4b-it", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "How to make a pizza?"}],"max_completion_tokens": 1000}}

Prompt AI Model: Submit HPC Job

Guides for submitting HPC job, please refer to: HPC Job Submission (For CLI) and HPC Job Submission (For Web Portal)