02 — Inference Optimization on L40S¶
1. Why L40S Inference is Different from H200¶
| Constraint | Impact | Mitigation |
|---|---|---|
| 864 GB/s vs 4.8 TB/s HBM | Decode is 5× more memory-bound | Larger batches, quantization |
| PCIe x16 for GPU-GPU | All-reduce is 14× slower | Minimize TP degree, use pipeline parallel |
| 48 GB per GPU | Smaller model shards | More aggressive quantization (INT4) |
| No NVLink | High latency tensor parallel | Prefer pipeline parallelism for multi-GPU |
| FP8 (no TE hardware scaling) | Manual quantization required | Use GPTQ/AWQ offline |
2. Quantization: Essential for L40S¶
Quantization is more important on L40S than H200 because: 1. Smaller per-GPU memory → larger models need more compression 2. Lower memory bandwidth → quantized ops improve memory-bound performance
GPTQ (Post-Training Quantization)¶
# Install AutoGPTQ
pip install auto-gptq
# Quantize Llama-3 70B to INT4 (GPTQ)
python - <<'EOF'
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_name = "meta-llama/Llama-3-70b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantize_config = BaseQuantizeConfig(
bits=4, # INT4 quantization
group_size=128, # quantization group size (128 is standard)
damp_percent=0.01,
desc_act=True, # activation ordering (better quality)
)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
device_map="auto",
)
# Calibration data (128 samples, 2048 tokens each)
examples = [tokenizer("calibration text " * 200, return_tensors="pt")]
model.quantize(examples)
model.save_quantized("/models/llama-3-70b-gptq-int4")
EOF
INT4 GPTQ memory savings: - FP16: 140 GB (70B model) - INT4: ~35 GB (70B model) → fits on 1 L40S! (with KV cache limits)
AWQ (Activation-Aware Weight Quantization)¶
pip install autoawq
python - <<'EOF'
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3-70b"
quant_path = "/models/llama-3-70b-awq-int4"
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True, # zero-point quantization (better quality)
"q_group_size": 128, # group size
"w_bit": 4, # INT4
"version": "GEMM", # GEMM or GEMV kernel
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
EOF
AWQ vs GPTQ comparison:
- AWQ: slightly better perplexity, faster inference (optimized GEMM kernels)
- GPTQ: more control, desc_act=True gives best quality
- Both: ~4× memory reduction vs FP16
FP8 Static Quantization for L40S¶
# For L40S, use offline FP8 quantization (no hardware TE scaling)
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8b",
torch_dtype=torch.float8_e4m3fn, # requires PyTorch 2.1+
device_map="cuda:0",
)
# Note: FP8 on Ada gives ~1.4× speedup vs FP16 (vs ~2× on Hopper with TE)
Quantization Decision Guide¶
| Model Size | L40S Strategy | GPUs Needed | Notes |
|---|---|---|---|
| 7B | FP16 or BF16 | 1 | 14 GB, fast, no quality loss |
| 13B | FP16 | 1 | 26 GB, fits with small KV cache |
| 34B | INT8 or AWQ INT4 | 1-2 | INT8: 34 GB (1 GPU), INT4: 17 GB |
| 70B | AWQ/GPTQ INT4 | 1-2 | INT4: 35 GB (1 GPU), max context limited |
| 180B | GPTQ INT4 | 4-5 | 90 GB total, need multi-GPU |
3. vLLM Configuration for L40S¶
from vllm import LLM, SamplingParams
# Single GPU, 7B model — standard deployment
llm = LLM(
model="meta-llama/Llama-3-8b-instruct",
dtype="bfloat16",
max_model_len=8192,
gpu_memory_utilization=0.90,
max_num_seqs=128,
)
# Single GPU, 70B INT4 AWQ — fits on 1 L40S
llm = LLM(
model="/models/llama-3-70b-awq-int4",
quantization="awq",
dtype="float16",
max_model_len=4096, # limit context due to 48 GB constraint
gpu_memory_utilization=0.85, # leave room for KV cache
max_num_seqs=64,
)
# Multi-GPU, 70B BF16 — 2 L40S with TP=2
llm = LLM(
model="meta-llama/Llama-3-70b-instruct",
tensor_parallel_size=2, # PCIe limited; keep TP low
dtype="bfloat16",
max_model_len=8192,
gpu_memory_utilization=0.90,
)
vLLM Tuning for PCIe Systems¶
# L40S-specific vLLM launch flags
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8b-instruct \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--max-num-batched-tokens 32768 \
--block-size 16 \
--port 8000
# For GPTQ/AWQ quantized models
python -m vllm.entrypoints.openai.api_server \
--model /models/llama-3-70b-awq \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 128
4. Continuous Batching Strategies¶
L40S Optimal Batch Size¶
Unlike H200 where large batch sizes are preferable, L40S has tighter memory constraints:
7B model on L40S (48 GB total):
Weights (BF16): 14 GB
CUDA reserved: ~2 GB
Available KV: ~32 GB
KV cache per token (Llama-3 8B, FP16):
2 × 32 layers × 8 kv-heads × 128 head-dim × 2 bytes = 131 KB/token
Max concurrent tokens at BS=256, seq=256:
256 × 256 = 65,536 tokens × 131 KB = ~8.6 GB → fits ✓
Sweet spot for L40S 7B: batch_size=128-256
# Benchmark batch sizes to find throughput peak
import torch, time
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8b",
torch_dtype=torch.bfloat16,
device_map="cuda:0",
)
model.eval()
for batch_size in [1, 4, 16, 32, 64, 128]:
input_ids = torch.randint(0, 32000, (batch_size, 128), device="cuda:0")
with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
for _ in range(3): model(input_ids) # warmup
t0 = time.perf_counter()
for _ in range(20): model(input_ids)
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
tps = batch_size * 128 * 20 / elapsed
print(f"BS={batch_size:4d}: {tps:8.0f} tokens/s")
5. Speculative Decoding¶
Speculative decoding is especially effective on L40S because the decode step is highly memory-bandwidth bound:
# L40S speculative decoding setup
llm = LLM(
model="meta-llama/Llama-3-70b-instruct", # target model (2 GPUs, BF16)
speculative_model="meta-llama/Llama-3-8b-instruct", # draft on 1 GPU
num_speculative_tokens=5,
tensor_parallel_size=2,
)
Alternative: use a tiny draft model (< 1B) for even larger speedups:
llm = LLM(
model="meta-llama/Llama-3-8b-instruct",
speculative_model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
num_speculative_tokens=6,
speculative_max_model_len=4096,
)
# Typical speedup: 1.5-2.5x on L40S (memory-bound decode benefits most)
6. KV Cache Quantization¶
On L40S (48 GB), KV cache compression is critical for long contexts:
# vLLM with FP8 KV cache
llm = LLM(
model="meta-llama/Llama-3-8b-instruct",
kv_cache_dtype="fp8", # cuts KV cache memory by 50%
max_model_len=32768, # now supports 32K context on single L40S
)
# FP8 KV cache impact on L40S 7B:
# FP16 KV: 131 KB/token → 32K context needs 4.2 GB (max ~240 batch sequences at 128 tokens)
# FP8 KV: 66 KB/token → 32K context needs 2.1 GB (nearly 2× more sequences)
7. Flash Attention 2 (L40S)¶
L40S supports Flash Attention 2 (not FA3, which is Hopper-specific):
# Flash Attention 2 is automatic in PyTorch ≥ 2.2 via SDPA
import torch
# This automatically uses FA2 on L40S (Ada)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=True)
# Memory savings: O(N) vs O(N²) for attention map
# Speed: 2-4× faster than naive attention for seq_len > 1024
8. Triton Inference Server Setup¶
For production multi-model deployment across 12 L40S GPUs:
# Model repository structure
model_repo/
├── llama-3-8b/
│ ├── config.pbtxt
│ └── 1/
│ └── model.py
├── llama-3-70b-awq/
│ ├── config.pbtxt
│ └── 1/
│ └── model.py
└── ensemble/
└── config.pbtxt
# config.pbtxt for vLLM backend on Triton
name: "llama-3-8b"
backend: "vllm"
max_batch_size: 256
model_transaction_policy {
decoupled: true # streaming
}
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0] # assign to GPU 0
}
]
parameters {
key: "model"
value: { string_value: "meta-llama/Llama-3-8b-instruct" }
}
parameters {
key: "dtype"
value: { string_value: "bfloat16" }
}
# Launch Triton with 12 L40S GPUs
tritonserver \
--model-repository=/model_repo \
--backend-config=vllm,cmdline_args="--max-num-seqs 256" \
--http-port 8000 \
--grpc-port 8001 \
--log-verbose 1