06 — Disaggregated Storage: NVMe-oF + OpenFlex + RapidFlex¶

1. Why Disaggregated Storage for AI¶

Traditional HPC storage model: storage is local to the compute node.

Compute Node A:    GPU + 8 × NVMe local drives
Compute Node B:    GPU + 8 × NVMe local drives
Compute Node C:    GPU + 8 × NVMe local drives

Problems:
  - Dataset must be replicated to every node (wastes capacity)
  - When a node fails, its local data is unavailable
  - Storage is idle when the node is running a different job
  - Adding more compute means buying more storage too (coupled scaling)

Disaggregated storage model:

Compute Node A:    GPU + NIC (no local NVMe)
Compute Node B:    GPU + NIC (no local NVMe)
Compute Node C:    GPU + NIC (no local NVMe)

Storage Server:    OpenFlex Data24 × N (NVMe pool, shared)
Network:           RoCEv2, 200 Gb/s (NVIDIA SN3700)

Benefits:
  + Dataset stored once, accessed by any compute node
  + Storage fails independently from compute (no coupled failure)
  + Add more storage without adding compute (and vice versa)
  + Storage utilization: 90%+ (shared pool vs 40% with local)

2. NVMe-oF Protocol Stack¶

NVMe over Fabrics (NVMe-oF) extends the NVMe protocol over a network fabric:

Standard NVMe (local):          NVMe-oF (network):

Application                     Application
    │                               │
    │ read(fd, buf, size)           │ read(fd, buf, size)   ← IDENTICAL API
    ▼                               ▼
VFS / Block Layer               VFS / Block Layer
    │                               │
NVMe Driver                     NVMe-oF Host Driver
    │                               │
PCIe Bus                        RDMA transport (RoCEv2 / iWARP / InfiniBand)
    │                               │
NVMe SSD                        Network Switch
                                    │
                                NVMe-oF Target (OpenFlex)
                                    │
                                NVMe SSDs

The GDS driver sees NVMe-oF namespaces as identical to local NVMe — same GDS code works for both.

3. RapidFlex Adapters: Making Remote Look Local¶

The WD RapidFlex AIC (Add-In Card) is installed in the compute server and handles the NVMe-oF protocol:

Inside the compute server:

PCIe Bus
├── GPU (x16)
├── NIC / ConnectX-7 (x16)
└── RapidFlex AIC (x8)
    │  Presents remote storage as local NVMe namespaces
    │  /dev/nvme0n1  →  actually OpenFlex drive 0 (over RDMA)
    │  /dev/nvme1n1  →  actually OpenFlex drive 1 (over RDMA)
    └── GDS sees these as standard NVMe devices

From GDS's perspective: remote NVMe-oF storage looks exactly like local NVMe.

This means: - Zero application code changes to use disaggregated storage - Same cuFileRead() call works for local and remote NVMe - Driver handles the NVMe-oF transport transparently

4. OpenFlex Data24 3200: Reference Storage System¶

WD OpenFlex Data24 3200 Series specifications:

Internal:
  24 × U.2 NVMe SSD slots (PCIe Gen3)
  Internal bandwidth: limited by PCIe Gen3 backplane

Frontend (network-facing):
  6 × 100 Gb/s Ethernet ports (via RapidFlex AIC frontend cards)
  Each port: 12.5 GB/s
  Total aggregate: 75 GB/s  (6 × 12.5 GB/s)
  Protocol: NVMe-oF over RoCEv2

Capacity:
  Scales with NVMe drive selection (e.g., 24 × 15.36 TB = 368 TB raw)
  After redundancy (RAID-like protection): ~300 TB usable

The 75 GB/s Ceiling¶

The OpenFlex's 6 × 100 Gb/s frontend is the key constraint:

4 compute nodes × 4 GPUs each = 16 GPUs total
OpenFlex: 75 GB/s total

Per-GPU bandwidth from OpenFlex: 75 / 16 = 4.7 GB/s
  ← sufficient for most training workloads (dataset read is not usually the bottleneck)

For GPU-intensive workloads where I/O IS the bottleneck:
  → Add more OpenFlex units (linear scaling: 2× units = 2× bandwidth)
  → 3 × OpenFlex = 225 GB/s = 14 GB/s per GPU (approaches NIC limits)

5. Network Configuration: Lossless RoCEv2¶

RoCEv2 (RDMA over Converged Ethernet v2) requires a lossless network — any dropped packet causes RDMA retransmission which degrades performance severely.

NVIDIA SN3700 Switch Configuration¶

# On the Spectrum-2 switch, configure lossless settings:

# 1. Enable Priority Flow Control (PFC) for RDMA priority
# PFC pauses specific traffic classes instead of dropping packets

# 2. Configure ECN (Explicit Congestion Notification)
# Early warning to senders before buffers overflow

# 3. Set DSCP markings for RDMA traffic
# Traffic class 3 (DSCP 26) for RDMA on most deployments

# Using NVOS CLI on SN3700:
interface ethernet 1/1 traffic-class 3 pfc
interface ethernet 1/1 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

Server-Side RoCEv2 Configuration¶

# On each compute server (ConnectX-7):

# Step 1: Set trust mode to DSCP (match switch DSCP markings)
mlnx_qos -i ens1f0 --trust=dscp

# Step 2: Enable PFC for priority 3 (RDMA)
mlnx_qos -i ens1f0 -p 0,0,0,1,0,0,0,0   # enable PFC on priority 3

# Step 3: Set DSCP→priority mapping
mlnx_qos -i ens1f0 --dscp2prio=26,3       # DSCP 26 → priority 3

# Step 4: Verify RoCEv2 mode
cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/3
# Expected: RoCE v2

# Step 5: Set RoCEv2 ECN (on the NIC)
cma_roce_mode -d mlx5_0 -p 1 -m 2        # mode 2 = RoCEv2

# Test RDMA connectivity
ib_write_bw -d mlx5_0 --report_gbits storage-server-ip &   # server
ib_write_bw -d mlx5_0 --report_gbits storage-server-ip     # client
# Expected: ~190 Gb/s (95% of 200 Gb/s ConnectX-7)

6. Connecting to OpenFlex via nvme-cli¶

# Step 1: Discover NVMe-oF targets on the OpenFlex
nvme discover \
    --transport rdma \
    --traddr 192.168.100.10 \       # OpenFlex frontend IP
    --trsvcid 4420                  # standard NVMe-oF RDMA port

# Expected output:
# Discovery Log Number of Records 8, Generation counter 1
# =====Discovery Log Entry 0======
# trtype: rdma
# adrfam: ipv4
# subtype: nvme subsystem
# treq: not specified
# portid: 0
# trsvcid: 4420
# subnqn: nqn.2018-01.org.nvmexpress:wd:data24:ns0
# traddr: 192.168.100.10

# Step 2: Connect to all discovered subsystems
nvme connect-all \
    --transport rdma \
    --traddr 192.168.100.10 \
    --trsvcid 4420

# Step 3: Verify connected namespaces
nvme list
# /dev/nvme4n1   WD OpenFlex Data24  NVMeoF   500GB
# /dev/nvme5n1   WD OpenFlex Data24  NVMeoF   500GB
# ...

# Step 4: Verify GDS sees the NVMe-oF devices
/usr/local/cuda/gds/tools/gdscheck -p
# Should show NVMe-oF devices in the GDS-compatible device list

# Step 5: Mount and prepare filesystem
mkfs.ext4 /dev/nvme4n1
mkdir -p /mnt/openflex0
mount -o noatime /dev/nvme4n1 /mnt/openflex0

# Step 6: Test GDS bandwidth over NVMe-oF
/usr/local/cuda/gds/tools/gds_bandwidth \
    --file=/mnt/openflex0/test.bin \
    --size=16384M \                    # 16 GB test file
    --gpu_id=0 \
    --pattern=sequential
# Expected: 20–25 GB/s (ConnectX-7 200 Gb/s NIC limited)

7. Linear Scale-Out¶

The key advantage of disaggregated storage: linear performance and capacity scaling.

1 × OpenFlex Data24:
  Frontend: 75 GB/s
  Capacity: ~300 TB usable
  Serves: up to 12 GPUs at 6 GB/s each

2 × OpenFlex Data24:
  Frontend: 150 GB/s (2 × 75)
  Capacity: ~600 TB usable
  Serves: up to 24 GPUs at 6 GB/s each

N × OpenFlex Data24:
  Frontend: N × 75 GB/s
  Capacity: N × 300 TB
  No central bottleneck — each unit has independent NIC ports

Compare with SAN/NAS:
  Traditional NAS: central controller = single point of failure + bottleneck
  OpenFlex: no central controller, each unit is independent

Storage-Compute Ratio Planning¶

def plan_gds_storage(
    num_gpus: int,
    per_gpu_io_bw_gbps: float,   # GB/s needed per GPU during training
    dataset_tb: float,            # total dataset size in TB
    replication_factor: float = 1.5  # some redundancy
):
    total_io_bw = num_gpus * per_gpu_io_bw_gbps
    openflex_per_unit_bw = 75   # GB/s
    openflex_per_unit_cap = 300  # TB usable

    units_for_bw = total_io_bw / openflex_per_unit_bw
    units_for_cap = (dataset_tb * replication_factor) / openflex_per_unit_cap
    units_needed = max(units_for_bw, units_for_cap)

    print(f"GPUs: {num_gpus}")
    print(f"Required I/O bandwidth: {total_io_bw:.0f} GB/s")
    print(f"Units for bandwidth: {units_for_bw:.1f}")
    print(f"Units for capacity ({dataset_tb} TB): {units_for_cap:.1f}")
    print(f"OpenFlex units needed: {int(units_needed) + 1}")

# Example: 32 GPUs, each needs 10 GB/s, 500 TB dataset
plan_gds_storage(num_gpus=32, per_gpu_io_bw_gbps=10, dataset_tb=500)
# GPUs: 32
# Required I/O bandwidth: 320 GB/s → 5 units for bandwidth
# Units for capacity (500 TB): 2.5 → 3 units for capacity
# OpenFlex units needed: 5 (bandwidth-limited)

8. End-to-End GDS + OpenFlex Architecture¶

Complete reference architecture (WD technical brief):

┌─────────────────────────────────────────────────────────────────┐
│                      Compute Cluster                            │
│                                                                 │
│  Node 0                           Node 1                        │
│  ┌──────────────────────┐         ┌──────────────────────┐      │
│  │ 4 × A100 GPU         │         │ 4 × A100 GPU         │      │
│  │ 6 × ConnectX-7       │         │ 6 × ConnectX-7       │      │
│  │   (3 per NUMA node)  │         │   (3 per NUMA node)  │      │
│  │ Local NVMe (8-bay)   │         │ Local NVMe (8-bay)   │      │
│  │   → hot tier cache   │         │   → hot tier cache   │      │
│  └──────────┬───────────┘         └──────────┬───────────┘      │
└─────────────┼───────────────────────────────┼──────────────────┘
              │  200 Gb/s RoCEv2               │
              │  (24 × DAC cables)             │
              ▼                                ▼
┌─────────────────────────────────────────────────────────────────┐
│           NVIDIA SN3700 Spectrum-2 Switch                       │
│           32 × 200 Gb/s, lossless RoCEv2                        │
└─────────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────┐
│           WD OpenFlex Data24 3200                               │
│                                                                 │
│  ┌─────────────────────────────────────────────┐               │
│  │ RapidFlex AIC (6 × 100 Gb/s Ethernet)       │               │
│  │ → Exposes NVMe namespaces via NVMe-oF/RoCEv2 │               │
│  └─────────────────────────────────────────────┘               │
│                                                                 │
│  24 × U.2 NVMe SSD (PCIe Gen3 internal)                         │
│  Max throughput: 75 GB/s total                                  │
└─────────────────────────────────────────────────────────────────┘

Data flow (GDS active):
  Training data on OpenFlex → RoCEv2 → ConnectX-7 → GPU HBM
  No CPU involvement at any stage
  GPU sees OpenFlex drives as local /dev/nvmeXnY via NVMe-oF