Skip to content

AI Hardware Engineer Roadmap

05 Performance Tuning

ai-hpc/ai-hardware-engineer-roadmap

05 — GDS Performance Tuning¶

1. The GDS Performance Equation¶

GDS Throughput = min(NVMe_BW, PCIe_BW, NIC_BW, GPU_HBM_BW)

For the reference system:
  NVMe (local, 8 drives × 7 GB/s):     56 GB/s
  PCIe Gen4 (per x16 slot):            32 GB/s
  NIC (3 × ConnectX-7 @ 25 GB/s):     75 GB/s  (NVMe-oF path)
  GPU HBM (A100):                    2000 GB/s  (never the bottleneck for I/O)
  OpenFlex storage side:               75 GB/s  (6 × 100 Gb/s)

Bottleneck for local NVMe:   PCIe bandwidth per GPU (32 GB/s)
Bottleneck for NVMe-oF:      OpenFlex storage output (75 GB/s for all GPUs)
Per-GPU NVMe-oF practical:   ~25 GB/s (75/3 with 3 NICs per NUMA node)

2. Alignment Tuning¶

Alignment is the single most impactful low-level tuning parameter:

GDS minimum alignment:    512 bytes (sector size)
Optimal alignment:        4096 bytes (filesystem block size)
Large transfer alignment: 1 MB (maximizes DMA efficiency)

Penalty for misalignment:
  512-byte aligned read of 1 GB:   6.8 GB/s
  Non-aligned read of 1 GB:        GDS FALLS BACK to compat mode
                                   → 2.1 GB/s (CPU-mediated)
  → 3.2× performance drop for alignment violations

// Always allocate GPU buffers with 512-byte alignment
void* allocate_aligned_gpu_buffer(size_t size) {
    // Round size up to 512-byte boundary
    size_t aligned_size = (size + 511) & ~511ULL;

    // Standard cudaMalloc guarantees 256-byte alignment on most GPUs
    // For 512-byte guarantee, use cuMemAlloc:
    CUdeviceptr d_ptr;
    CUresult result = cuMemAlloc(&d_ptr, aligned_size);
    assert(result == CUDA_SUCCESS);
    assert((d_ptr & 511) == 0);   // verify alignment
    return (void*)d_ptr;
}

// For numpy/Python interop, use 4096-byte alignment:
import numpy as np
arr = np.empty(N, dtype=np.float32)
arr = np.require(arr, requirements=['C_CONTIGUOUS', 'ALIGNED'])
# Or:
arr = np.empty(N + 4096//4, dtype=np.float32)
# Manually align: offset = (-arr.ctypes.data % 4096) // 4

3. Transfer Size Optimization¶

GDS throughput vs transfer size (local NVMe, reference system):

  4 KB:     0.8 GB/s   ← tiny transfers, DMA setup overhead dominates
  64 KB:    2.1 GB/s
  256 KB:   4.5 GB/s
  1 MB:     6.0 GB/s
  4 MB:     6.6 GB/s
  16 MB:    6.8 GB/s  ← near peak (PCIe Gen4 x4)
  64 MB:    6.8 GB/s  ← peak (no further gain)

Recommendation: use ≥ 1 MB per GDS transfer
                For large files: split into 4-16 MB chunks for batch I/O

OPTIMAL_CHUNK_SIZE = 4 * 1024 * 1024   # 4 MB

def chunked_gds_read(filepath: str, total_size: int, gpu_buf) -> None:
    with cufile.open(filepath, "r") as f:
        offset = 0
        while offset < total_size:
            chunk = min(OPTIMAL_CHUNK_SIZE, total_size - offset)
            f.read(gpu_buf, size=chunk, file_offset=offset, buf_offset=offset)
            offset += chunk

4. Queue Depth and Parallelism¶

NVMe drives have internal queue depth — sending multiple requests simultaneously keeps the drive busy:

NVMe queue depth: up to 64K per namespace, typically use 128
GDS optimal queue depth: 32-128 concurrent operations

Without queue depth (sequential):
  [Submit read 0][Wait complete][Submit read 1][Wait complete]...
  Drive utilization: ~40% (idle between submissions)
  Throughput: 3.5 GB/s

With queue depth 32 (batch I/O):
  [Submit 32 reads simultaneously]
  Drive utilization: ~90%
  Throughput: 6.5 GB/s

// /etc/cufile.json — tune queue depth
{
  "execution": {
    "max_io_queue_depth": 128,     // max concurrent GDS ops per file handle
    "num_io_threads": 8,           // I/O worker threads in daemon
    "max_batch_io_timeout_msecs": 5
  }
}

// Set queue depth per file handle
CUfileDescr_t cf_descr = {};
cf_descr.handle.fd = fd;
cf_descr.type = CU_FILE_HANDLE_TYPE_OPAQUE_FD;
// Queue depth is managed by the cuFile daemon based on cufile.json

5. Multi-Stream I/O Overlap¶

Use separate CUDA streams to pipeline I/O with compute:

import torch
import cufile

def train_with_gds_overlap(model, data_dir, num_steps, device="cuda:0"):
    """
    Double-buffered training:
    - While GPU processes batch N (compute stream)
    - GDS loads batch N+1 (io stream)
    Both run in parallel → I/O latency hidden by compute
    """
    files = sorted(glob.glob(f"{data_dir}/*.bin"))

    # Two GPU buffers: ping-pong
    buffers = [
        torch.empty(BATCH_SHAPE, dtype=torch.bfloat16, device=device),
        torch.empty(BATCH_SHAPE, dtype=torch.bfloat16, device=device),
    ]

    # Pre-load first file
    with cufile.open(files[0], "r") as f:
        f.read(buffers[0])

    compute_stream = torch.cuda.Stream(device=device)
    # cufile reads are managed by GDS daemon (effectively async to GPU compute)

    for step in range(num_steps):
        cur = step % 2
        nxt = (step + 1) % 2

        # Launch compute on current buffer (non-blocking)
        with torch.cuda.stream(compute_stream):
            loss = model(buffers[cur])
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        # Start loading NEXT batch while compute runs
        if step + 1 < num_steps:
            # GDS read (runs in background via daemon)
            with cufile.open(files[(step + 1) % len(files)], "r") as f:
                f.read(buffers[nxt])

        # Wait for compute to finish before next iteration
        torch.cuda.current_stream().wait_stream(compute_stream)

        if step % 10 == 0:
            print(f"Step {step}: loss={loss.item():.4f}")

6. Buffer Registration Caching¶

cuFileBufRegister() is expensive (~1 ms). For repeated reads, register once:

class GDSBufferPool {
    struct RegisteredBuffer {
        void*  ptr;
        size_t size;
        bool   in_use;
    };

    std::vector<RegisteredBuffer> pool;
    size_t buffer_size;

public:
    GDSBufferPool(int num_buffers, size_t buf_size) : buffer_size(buf_size) {
        for (int i = 0; i < num_buffers; i++) {
            void* ptr;
            cudaMalloc(&ptr, buf_size);
            cuFileBufRegister(ptr, buf_size, 0);   // register once at startup
            pool.push_back({ptr, buf_size, false});
        }
    }

    void* acquire() {
        for (auto& buf : pool) {
            if (!buf.in_use) { buf.in_use = true; return buf.ptr; }
        }
        return nullptr;  // pool exhausted
    }

    void release(void* ptr) {
        for (auto& buf : pool) {
            if (buf.ptr == ptr) { buf.in_use = false; return; }
        }
    }

    ~GDSBufferPool() {
        for (auto& buf : pool) {
            cuFileBufDeregister(buf.ptr);
            cudaFree(buf.ptr);
        }
    }
};

// One-time setup at process start
GDSBufferPool pool(16, 64 * 1024 * 1024);  // 16 × 64 MB buffers, pre-registered

// In training loop: acquire → read → process → release (no registration cost)
void* buf = pool.acquire();
cuFileRead(fh, buf, size, offset, 0);
processKernel<<<grid, block>>>(buf);
pool.release(buf);

7. Benchmarking GDS Performance¶

Using gds_bandwidth (Built-in)¶

# Sequential read benchmark
/usr/local/cuda/gds/tools/gds_bandwidth \
    --file=/mnt/nvme0/test_file.bin \
    --size=4096M \                        # 4 GB test file
    --gpu_id=0 \
    --num_threads=1 \
    --pattern=sequential

# Random read benchmark (important for datasets with shuffling)
/usr/local/cuda/gds/tools/gds_bandwidth \
    --file=/mnt/nvme0/test_file.bin \
    --size=4096M \
    --gpu_id=0 \
    --pattern=random \
    --block_size=65536                    # 64 KB random blocks

# Expected sequential read (reference system, GDS active):
# GDS Read:    6.5–6.8 GB/s  (local NVMe, PCIe Gen4 x4)
# Compat Read: 1.8–2.1 GB/s  (CPU path)
# GDS 3.2× faster

# Expected NVMe-oF read (via ConnectX-7, OpenFlex):
# GDS RDMA:   22–25 GB/s  (per GPU, limited by NIC)

Custom Benchmark¶

import time
import torch
import cufile

def benchmark_gds(filepath: str, size_mb: int, gpu_id: int, n_iter: int = 20):
    size = size_mb * 1024 * 1024
    buf = torch.empty(size // 4, dtype=torch.float32, device=f"cuda:{gpu_id}")

    # Warmup
    with cufile.open(filepath, "r") as f:
        for _ in range(3):
            f.read(buf)
    torch.cuda.synchronize()

    # Benchmark
    t0 = time.perf_counter()
    with cufile.open(filepath, "r") as f:
        for _ in range(n_iter):
            f.read(buf)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0

    bw = size * n_iter / elapsed / 1e9
    print(f"GDS Read Bandwidth: {bw:.2f} GB/s ({size_mb} MB × {n_iter} iterations)")
    return bw

benchmark_gds("/mnt/nvme0/test.bin", size_mb=1024, gpu_id=0)

8. Performance Targets (Reference System)¶

Configuration	Path	Expected Throughput	Bottleneck
1 GPU, 1 local NVMe (x4)	GDS direct	6.5–6.8 GB/s	PCIe Gen4 x4
1 GPU, 4 local NVMe (x4) striped	GDS direct	20–25 GB/s	PCIe switch bandwidth
1 GPU, NVMe-oF via CX-7	GDS RDMA	22–25 GB/s	NIC 25 GB/s
4 GPUs, OpenFlex 75 GB/s	GDS RDMA	~18 GB/s/GPU	OpenFlex 75/4
1 GPU, CPU path (no GDS)	compat mode	1.8–2.1 GB/s	CPU DRAM BW

References¶