04 — Production Deployment Guide for L40S x12¶
1. Pre-Deployment Checklist¶
Hardware Validation¶
# 1. Verify all 12 GPUs are detected
nvidia-smi --query-gpu=index,name,memory.total,driver_version --format=csv
# Expected: 12 lines, "NVIDIA L40S", "48102 MiB"
# 2. Check PCIe link speeds
nvidia-smi --query-gpu=index,pcie.link.gen.current,pcie.link.width.current --format=csv
# Expected: gen=4, width=16 for all GPUs
# 3. Verify NVLink is NOT expected (L40S has none)
nvidia-smi nvlink --status -i 0
# Expected: "NVLink not supported" — this is correct for L40S
# 4. Check PCIe topology (identify GPU pairs on same switch)
nvidia-smi topo -m
# 5. Thermal baseline (before load)
nvidia-smi --query-gpu=index,temperature.gpu,power.draw --format=csv
# GPU temp should be 30-45°C idle
# 6. ECC status
nvidia-smi --query-gpu=index,ecc.mode.current --format=csv
# Enable ECC for production: sudo nvidia-smi -e 1 (requires reboot)
Software Stack¶
# Required versions for production L40S deployment
nvidia-smi # Driver ≥ 535.x
nvcc -V # CUDA ≥ 12.1
python -c "import torch; print(torch.__version__)" # PyTorch ≥ 2.2
python -c "import vllm; print(vllm.__version__)" # vLLM ≥ 0.4.0
# Install flash-attn (Ada Lovelace compatible)
pip install flash-attn --no-build-isolation
# Verify flash-attn works
python -c "
import torch
from flash_attn import flash_attn_func
q = torch.randn(1, 1, 32, 128, device='cuda', dtype=torch.float16)
k = torch.randn(1, 1, 32, 128, device='cuda', dtype=torch.float16)
v = torch.randn(1, 1, 32, 128, device='cuda', dtype=torch.float16)
out = flash_attn_func(q, k, v)
print('FlashAttention OK')
"
2. Single-GPU Deployment (7B / 13B Models)¶
For models ≤ 48 GB, deploy one model per L40S. With 12 GPUs, this gives 12 independent replicas.
systemd Service for Each GPU¶
# /etc/systemd/system/vllm-gpu0.service
[Unit]
Description=vLLM Inference Server GPU 0
After=network.target
[Service]
Type=simple
User=mlops
Environment=CUDA_VISIBLE_DEVICES=0
Environment=TRANSFORMERS_CACHE=/models/cache
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8b-instruct \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--port 8000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# Deploy all 12 GPUs
for i in $(seq 0 11); do
PORT=$((8000 + i))
sed "s/GPU 0/GPU $i/; s/DEVICES=0/DEVICES=$i/; s/port 8000/port $PORT/" \
vllm-gpu0.service > /etc/systemd/system/vllm-gpu${i}.service
done
systemctl daemon-reload
systemctl enable --now vllm-gpu{0..11}
Docker Compose Alternative¶
# docker-compose.yml
version: "3.8"
services:
vllm-0:
image: vllm/vllm-openai:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
ports: ["8000:8000"]
volumes:
- /models:/models
command: >
--model meta-llama/Llama-3-8b-instruct
--dtype bfloat16
--max-model-len 8192
--port 8000
environment:
- HF_TOKEN=${HF_TOKEN}
restart: always
vllm-1:
# ... same, device_id: ['1'], port: 8001
# ... repeat for GPU 2-11
3. Load Balancer Configuration (NGINX)¶
# /etc/nginx/conf.d/vllm-cluster.conf
upstream vllm_8b {
least_conn;
server 127.0.0.1:8000;
server 127.0.0.1:8001;
server 127.0.0.1:8002;
server 127.0.0.1:8003;
server 127.0.0.1:8004;
server 127.0.0.1:8005;
server 127.0.0.1:8006;
server 127.0.0.1:8007;
server 127.0.0.1:8008;
server 127.0.0.1:8009;
server 127.0.0.1:8010;
server 127.0.0.1:8011;
keepalive 64;
}
server {
listen 80;
server_name inference.yourdomain.com;
location /v1/ {
proxy_pass http://vllm_8b;
proxy_http_version 1.1;
proxy_set_header Connection ""; # keepalive
proxy_read_timeout 300s;
proxy_send_timeout 300s;
# Rate limiting per client
limit_req zone=api_limit burst=100 nodelay;
}
location /health {
proxy_pass http://vllm_8b;
}
}
# Rate limiting zone
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
4. Monitoring Stack¶
Prometheus + Grafana Setup¶
# prometheus.yml
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:8000', 'localhost:8001', ...] # all 12 instances
metrics_path: /metrics
- job_name: 'nvidia_gpu'
static_configs:
- targets: ['localhost:9400'] # dcgm-exporter
# Run DCGM exporter for GPU metrics
docker run -d --gpus all \
--rm -p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04 \
-f /etc/dcgm-exporter/dcp-metrics-included.csv
Key Metrics to Monitor¶
# vLLM exposes these Prometheus metrics at /metrics
VLLM_METRICS = {
"vllm:num_requests_running": "Active requests",
"vllm:num_requests_waiting": "Queued requests",
"vllm:gpu_cache_usage_perc": "KV cache usage %",
"vllm:time_to_first_token_seconds": "TTFT histogram",
"vllm:time_per_output_token_seconds": "TPOT histogram",
"vllm:request_success_total": "Successful requests",
"vllm:request_prompt_tokens_total": "Input tokens served",
"vllm:request_generation_tokens_total": "Output tokens served",
}
# GPU metrics (via DCGM)
GPU_METRICS = {
"DCGM_FI_DEV_GPU_UTIL": "GPU utilization %",
"DCGM_FI_DEV_MEM_COPY_UTIL": "Memory bandwidth utilization %",
"DCGM_FI_DEV_FB_USED": "GPU memory used (MB)",
"DCGM_FI_DEV_POWER_USAGE": "Power draw (W)",
"DCGM_FI_DEV_GPU_TEMP": "Temperature (°C)",
}
Alert Rules¶
# alert_rules.yml
groups:
- name: l40s_inference
rules:
- alert: HighTTFT
expr: histogram_quantile(0.95, vllm:time_to_first_token_seconds) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "P95 TTFT > 500ms on {{ $labels.instance }}"
- alert: GPUHighTemp
expr: DCGM_FI_DEV_GPU_TEMP > 82
for: 1m
labels:
severity: critical
annotations:
summary: "GPU {{ $labels.gpu }} approaching thermal limit"
- alert: HighQueueDepth
expr: vllm:num_requests_waiting > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Request queue backing up on {{ $labels.instance }}"
- alert: LowGPUUtilization
expr: DCGM_FI_DEV_GPU_UTIL < 30
for: 10m
labels:
severity: info
annotations:
summary: "GPU underutilized — consider reducing replicas"
5. Multi-Model Deployment (Mixed Workloads)¶
Use 12 GPUs to serve multiple models simultaneously:
GPU 0-5 → 6 × Llama-3-8B (BF16, 1 GPU each) — high-volume chat
GPU 6-7 → 1 × Llama-3-70B (BF16, TP=2) — complex reasoning
GPU 8-9 → 1 × Llama-3-70B (BF16, TP=2) — complex reasoning
GPU 10 → 1 × CodeLlama-34B (INT8, 1 GPU) — code generation
GPU 11 → 1 × embedding model — RAG embeddings
# Intelligent router based on task type
ROUTING_CONFIG = {
"chat": {"endpoints": [f"http://localhost:{8000+i}" for i in range(6)]},
"reasoning": {"endpoints": ["http://localhost:8006", "http://localhost:8007"]},
"code": {"endpoints": ["http://localhost:8010"]},
"embeddings": {"endpoints": ["http://localhost:8011"]},
}
async def route_request(task_type: str, request: dict):
endpoints = ROUTING_CONFIG[task_type]["endpoints"]
# least-connections load balancing
endpoint = min(endpoints, key=lambda ep: get_connection_count(ep))
return await forward(endpoint, request)
6. Kubernetes Deployment¶
# GPU-per-pod strategy: 12 pods, 1 GPU each
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b
spec:
replicas: 12
selector:
matchLabels:
app: vllm-8b
template:
metadata:
labels:
app: vllm-8b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: "1" # 1 L40S per pod
memory: "64Gi"
cpu: "8"
args:
- "--model=meta-llama/Llama-3-8b-instruct"
- "--dtype=bfloat16"
- "--max-model-len=8192"
- "--gpu-memory-utilization=0.90"
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-credentials
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: vllm-8b-service
spec:
selector:
app: vllm-8b
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
7. Health Checking and Auto-Recovery¶
# health_monitor.py — auto-restart unhealthy instances
import asyncio, aiohttp, subprocess, logging
INSTANCES = [{"gpu": i, "port": 8000+i} for i in range(12)]
HEALTH_INTERVAL = 30 # seconds
RESTART_COOLDOWN = 60 # seconds
async def check_health(session: aiohttp.ClientSession, port: int) -> bool:
try:
async with session.get(f"http://localhost:{port}/health", timeout=5) as r:
return r.status == 200
except Exception:
return False
async def restart_instance(gpu_id: int, port: int):
logging.warning(f"Restarting vllm on GPU {gpu_id} port {port}")
subprocess.run(["systemctl", "restart", f"vllm-gpu{gpu_id}"])
await asyncio.sleep(RESTART_COOLDOWN)
async def monitor():
restart_times = {}
async with aiohttp.ClientSession() as session:
while True:
for inst in INSTANCES:
healthy = await check_health(session, inst["port"])
if not healthy:
last_restart = restart_times.get(inst["gpu"], 0)
if asyncio.get_event_loop().time() - last_restart > RESTART_COOLDOWN:
await restart_instance(inst["gpu"], inst["port"])
restart_times[inst["gpu"]] = asyncio.get_event_loop().time()
await asyncio.sleep(HEALTH_INTERVAL)
asyncio.run(monitor())