01 — L40S Hardware Architecture¶
1. Die and Process¶
- GPU die: AD102 (Ada Lovelace architecture, TSMC 4N)
- Same die as: RTX 4090 (consumer), but different thermal/power/firmware profile
- Transistors: 76.3 billion
- SMs: 142 Streaming Multiprocessors
- CUDA cores: 18,176
- Tensor Cores: 4th generation (FP8, FP16, BF16, TF32, INT8, INT4)
- FP8 TFLOPS: ~733 TFLOPS (sparse), ~366 TFLOPS (dense)
- BF16 TFLOPS: ~366 TFLOPS (sparse), ~183 TFLOPS (dense)
- TF32 TFLOPS: ~183 TFLOPS (dense)
The L40S is distinguished from L40 (no FP8) by adding FP8 Tensor Core support, making it suitable for current-generation LLM inference.
2. GDDR6 Memory Subsystem¶
| Property | L40S | A10 (prev gen) | A100 PCIe |
|---|---|---|---|
| Memory type | GDDR6 | GDDR6 | HBM2e |
| Capacity | 48 GB | 24 GB | 80 GB |
| Bandwidth | 864 GB/s | 600 GB/s | 1,935 GB/s |
| Interface | 384-bit | 384-bit | 5120-bit |
GDDR6 vs HBM: Key Differences¶
GDDR6 (L40S):
+ Cheaper to manufacture (standard PCB stacking)
+ Higher capacity per dollar
+ Good bandwidth for inference (memory-bound decode)
− ~5× less bandwidth than HBM3e
− PCIe attachment (shared CPU-GPU bandwidth)
− No on-package NVLink possible
HBM3e (H200):
+ Extreme bandwidth (4.8 TB/s per GPU)
+ On-package with NVSwitch for GPU-GPU transfers
− Expensive (specialized packaging)
− Fixed capacity tiers (80 GB, 141 GB)
For LLM decode (the throughput bottleneck), memory bandwidth is the critical metric. L40S achieves 864 GB/s vs H200's 4.8 TB/s — the gap is significant for memory-bound workloads.
Memory Capacity Planning for 12x L40S¶
Total GPU memory: 12 × 48 GB = 576 GB
Model weight allocation (FP16):
7B model: 14 GB → fits on 1 GPU (34 GB free for KV cache)
13B model: 26 GB → fits on 1 GPU (22 GB free for KV cache)
34B model: 68 GB → needs 2 GPUs (14 GB/GPU free)
70B model: 140 GB → needs 3-4 GPUs (depends on KV cache needs)
180B model: 360 GB → needs 8-10 GPUs
3. PCIe Topology (No NVLink)¶
The L40S uses PCIe 4.0 x16 as its only host and GPU-to-GPU interconnect. This is the most important architectural constraint to understand.
PCIe Bandwidth vs NVLink¶
PCIe 4.0 x16: ~32 GB/s per direction (bidirectional: 64 GB/s)
NVLink 4.0: 900 GB/s bidirectional per GPU
Ratio: NVLink is 14× faster for GPU-to-GPU communication.
12-GPU PCIe Topology (Typical Server)¶
CPU 0 (socket 0) CPU 1 (socket 1)
| |
PCIe Root Complex 0 PCIe Root Complex 1
| | | |
Switch 0 Switch 1 Switch 2 Switch 3
/ \ / \ / \ / \
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
| |
GPU8 GPU9
GPU10 GPU11
Verifying PCIe Topology¶
# Show full topology including NUMA and PCIe relationship
nvidia-smi topo -m
# Output (simplified):
# GPU0 GPU1 GPU2 GPU3 ... CPU Affinity
# GPU0 X SYS SYS SYS ... 0-15
# GPU1 SYS X SYS SYS ... 0-15
# ...
# SYS = traverses PCIe through CPU NUMA node (highest latency)
# NODE = traverses PCIe within same NUMA node (medium latency)
# PHB = traverses PCIe host bridge (low latency)
# PXB = traverses PCIe switch (lowest latency, like NVLink)
# Measure actual P2P bandwidth
python -c "
import torch
a = torch.randn(1024*1024*256, device='cuda:0', dtype=torch.float16) # 512 MB
b = torch.empty_like(a).to('cuda:1')
import time
for _ in range(5): b.copy_(a)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(100): b.copy_(a)
torch.cuda.synchronize()
bw = 512e6 * 100 / (time.perf_counter() - t0) / 1e9
print(f'P2P bandwidth GPU0→GPU1: {bw:.1f} GB/s')
# Expected: 24-30 GB/s (PCIe 4.0, direct switch)
# Poor result: < 10 GB/s (traverses NUMA boundary)
"
4. L40S PCIe Form Factor Advantages¶
Rack Density¶
2U server (typical):
4 × L40S @ 350W = 1,400W total
4U server (dense GPU):
8 × L40S @ 350W = 2,800W total
For 12 GPUs:
Option A: 3 × 4U servers (4 GPUs each), cross-server via InfiniBand
Option B: 1 × 6U or 8U super-dense chassis
Option C: 2U + 4U combination
H100/H200 SXM reference:
DGX H100: 8 GPUs, 10.2U, 10.2kW
L40S equivalent 8 GPUs: ~4U, ~2.8kW
Flexibility¶
- L40S can run in any PCIe 4.0 server (no SXM baseboard needed)
- Mix with CPUs, FPGAs, or networking cards in same chassis
- Standard power connectors (PCIe 16-pin, 600W cable)
- Replaceable individually (no SXM module replacement)
5. Key Features for Inference¶
NVENC / NVDEC (Media Engines)¶
L40S includes 2× NVENC + 2× NVDEC per GPU — relevant for multimodal inference pipelines processing video.
Ada Lovelace Shader Execution Reordering (SER)¶
SER dynamically reorders shader workloads to improve occupancy — primarily useful for graphics. For compute/AI, standard CUDA scheduling applies.
ADA FP8 vs Hopper FP8¶
Ada (L40S) FP8: FP8 Tensor Cores, E4M3 and E5M2 formats
Hopper (H200) FP8: FP8 + Transformer Engine for automated scaling
+ hardware-accelerated amax tracking
For L40S: FP8 quantization must be done offline (PTQ)
No hardware delayed scaling support
Use GPTQ/AWQ for post-training quantization instead
6. Power and Cooling¶
| Property | Value |
|---|---|
| TDP per L40S | 350 W |
| 12-GPU system TDP | ~4,200 W (GPUs) |
| Cooling | Air-cooled (passive heatsink + server fans) |
| Required airflow | Front-to-back, 200+ CFM recommended |
| PCIe power connector | 16-pin ATX 3.0 (600W capable) |
Thermal Management¶
# Monitor GPU temperatures and fan speed
nvidia-smi dmon -s pucvt -d 5 -i 0,1,2,3,4,5,6,7,8,9,10,11
# Set power limit (if thermal throttling occurs)
sudo nvidia-smi -pl 300 -i 0 # reduce to 300W for GPU 0
# Check throttling reasons
nvidia-smi -q -d PERFORMANCE | grep "Reason"
# "Active: Yes" under "SW Thermal Slowdown" means throttling