NVIDIA CUDA-X Libraries¶
Parent: Phase 5 — High Performance Computing
Timeline: Ongoing reference — study each category as your projects demand it.
Prerequisites: Phase 1 §4 (C++/CUDA), Phase 4 Track B (Jetson, CUDA runtime), Phase 4 Track C Part 2 (DL inference optimization).
What CUDA-X is¶
CUDA-X is NVIDIA's full suite of GPU-accelerated libraries built on top of the CUDA runtime. While CUDA gives you the programming model (kernels, streams, memory), CUDA-X gives you production-grade implementations of the algorithms you'd otherwise spend months writing: linear algebra, FFTs, neural network primitives, graph analytics, video codecs, multi-GPU communication, and more.
Every layer of the AI chip stack consumes CUDA-X libraries: - L1 (Application): cuDNN, TensorRT, DALI, CV-CUDA, DeepStream - L2 (Compiler): CUTLASS, FlashInfer, cuDNN primitives as codegen targets - L3 (Runtime): cuBLAS, cuFFT, NCCL, NVSHMEM, GPUDirect Storage - L1–L3 (Data): cuDF, cuML, cuGraph, cuVS, NeMo Curator
Understanding what each library does — and when to use it vs. writing a custom kernel — is essential for any GPU infrastructure role.
CUDA Math Libraries¶
The foundation for compute-intensive workloads: molecular dynamics, CFD, medical imaging, seismic exploration, and the matrix math inside every neural network.
| Library | What it does | Key APIs / Concepts | When to use it |
|---|---|---|---|
| cuBLAS | GPU-accelerated BLAS (Basic Linear Algebra Subprograms) | GEMM (cublasSgemm, cublasGemmEx), batched GEMM, mixed-precision (FP16/BF16/FP8 accumulate to FP32) |
Any dense matrix multiply — the single most important library for AI inference and training |
| cuFFT | Fast Fourier Transform on GPU | 1D/2D/3D FFTs, real-to-complex, batched FFTs, multi-GPU FFT | Signal processing, audio, spectral analysis, conv via FFT |
| cuRAND | Random number generation on GPU | Pseudorandom (XORWOW, MRG32k3a, Philox) and quasirandom (Sobol) generators; host and device API | Monte Carlo simulations, dropout in training, data augmentation |
| cuSOLVER | Dense and sparse direct solvers | LU, Cholesky, QR, SVD, eigenvalue decomposition; sparse Cholesky | Linear systems, least squares, PCA, matrix factorization |
| cuSPARSE | Sparse matrix operations | SpMV, SpMM, sparse triangular solve; CSR, CSC, COO formats | GNNs, sparse attention, scientific computing with sparse matrices |
| cuTENSOR | Tensor contractions and operations | Tensor contraction, reduction, element-wise ops on multi-dimensional arrays | Quantum chemistry, tensor networks, high-dimensional contractions |
| cuDSS | Direct sparse solver | Sparse LU/Cholesky with GPU acceleration | Large sparse systems in structural analysis, circuit simulation |
| CUDA Math API | Standard math functions | sin, cos, exp, log, sqrt — GPU-optimized, single and double precision |
Any kernel that needs standard math — these are intrinsics, not library calls |
| AmgX | Algebraic multigrid solver | AMG preconditioner + Krylov solvers for implicit unstructured methods | CFD, structural mechanics, any PDE solver using implicit methods |
| nvmath-python | Python interface to CUDA math | nvmath.fft, nvmath.linalg — NumPy-compatible but GPU-accelerated |
Rapid prototyping of GPU math in Python without writing C++ |
What to study first¶
cuBLAS — if you understand GEMM (shapes, leading dimensions, transposition, batching, mixed precision), you understand 80% of AI compute. Start here.
Projects¶
- cuBLAS GEMM benchmark — Benchmark
cublasGemmExfor FP32, FP16, INT8 on your GPU. Plot TFLOPS vs matrix size. Find the crossover where GPU beats CPU BLAS. - cuFFT spectral analysis — Apply 2D FFT to an image, filter high frequencies, inverse FFT. Compare GPU vs CPU time for 4K images.
- cuSOLVER SVD — Compute SVD of a large matrix on GPU. Use it for image compression (keep top-k singular values). Benchmark vs NumPy.
Scientific Computing Libraries¶
For applications requiring neural networks that respect mathematical symmetries — molecular structures, proteins, materials science.
| Library | What it does | Domain |
|---|---|---|
| cuEquivariance | Accelerate geometry-aware neural networks (rotation/translation equivariance in 3D) | Molecular dynamics, protein folding, materials discovery |
| NVIDIA ALCHEMI | NIM microservices for chemical and materials discovery | Drug discovery, battery materials, catalysts |
| cuLitho | Computational lithography acceleration | Semiconductor manufacturing (L7/L8 connection) |
| cuEST | Electronic structure calculations on GPU | Quantum chemistry, DFT |
Physics Libraries¶
GPU-accelerated physics simulation — relevant for robotics simulation, autonomous vehicle testing, and digital twins.
| Library | What it does | Use case |
|---|---|---|
| NVIDIA Warp | Python framework for GPU physics kernels | Simulation AI, robotics, differentiable physics |
| NVIDIA PhysicsNeMo | Training and fine-tuning physics AI models | Weather prediction, CFD surrogate models |
| NVIDIA Earth-2 | Weather and climate AI models | Climate modeling, weather forecasting |
Quantum Computing Libraries¶
| Library | What it does |
|---|---|
| cuQuantum | Accelerate quantum circuit simulation on GPUs |
| cuPQC | Post-quantum cryptography acceleration |
| CUDA-Q QEC | Quantum error correction simulation |
| CUDA-Q Solvers | Hybrid quantum-classical optimization |
Deep Learning Core Libraries¶
The libraries that directly implement neural network inference and training. These are the most critical for AI hardware engineers — they are the primary consumers of GPU compute.
| Library | What it does | Layer | Key concepts |
|---|---|---|---|
| cuDNN | Neural network primitives: conv, attention, normalization, pooling, activation | L1/L2 | cudnnConvolutionForward, cudnnMultiHeadAttnForward, graph API for fusion; auto-tuning selects fastest algorithm per hardware |
| TensorRT | Inference optimizer + runtime: graph optimization, layer fusion, precision calibration, engine serialization | L2/L3 | Build phase (optimize graph) → serialize → deploy phase (execute). INT8/FP8 calibration, dynamic shapes, DLA offload |
| TensorRT-LLM | LLM-specific inference engine: in-flight batching, paged KV-cache, tensor/pipeline parallelism, speculative decoding, FP8/INT4 | L2/L3 | Compiles LLMs to optimized TRT engines with custom Hopper/Blackwell kernels. Python API for build + deploy. Integrates with Triton for production serving. See Phase 4C Part 2 Unit 05 for deep dive |
| CUTLASS | C++ template library for custom GEMM, conv, and attention kernels targeting Tensor Cores | L2/L3 | CuTe layout DSL, tile-based programming, epilogue fusion, warp-level MMA. The reference for how production GEMM kernels are structured |
| FlashInfer | Optimized attention and MoE kernels for LLM inference via Python API | L2 | Paged KV-cache attention, variable-length sequences, speculative decoding support |
What to study first¶
- cuDNN — understand what primitives exist and how frameworks (PyTorch, TensorRT) call them
- TensorRT — end-to-end inference optimization for vision/small models (covered in Phase 4 Track C Part 2 and Track B §8)
- TensorRT-LLM — LLM-specific inference: in-flight batching, paged KV-cache, multi-GPU serving (covered in Phase 4 Track C Part 2 Unit 05)
- CUTLASS — when you need to write or modify GEMM/attention kernels (covered in Phase 4 Track C Part 2, unit 02)
Projects¶
- cuDNN algorithm benchmark — For a specific conv layer (ResNet-50 first conv), enumerate all cuDNN algorithms (
cudnnFindConvolutionForwardAlgorithm). Compare time and workspace for each. Understand why auto-tuning matters. - CUTLASS custom GEMM — Build a CUTLASS GEMM with a fused bias+ReLU epilogue. Benchmark against cuBLAS. Study the CuTe layout to understand tiling.
- FlashInfer attention — Run FlashInfer paged attention on a transformer model. Compare latency vs standard PyTorch attention for 4K, 16K, 64K context lengths.
Parallel Algorithm Libraries (CCCL)¶
CUDA Core Compute Libraries — fundamental building blocks for writing GPU algorithms in C++ and Python.
| Library | What it does | When to use it |
|---|---|---|
| Thrust | C++ STL-like parallel algorithms: sort, reduce, scan, transform | High-level parallel algorithms without writing raw CUDA kernels |
| CUB | Warp-wide, block-wide, and device-wide cooperative primitives | Inside custom CUDA kernels — block reduce, block scan, radix sort |
| cuda.compute | Python interface to CCCL device-level algorithms | GPU-accelerated algorithms from Python |
| cuda.parallel | Standardized sort, scan, reduction primitives | Distributed and local parallel patterns |
What to study first¶
CUB — if you write custom CUDA kernels, you'll use CUB's BlockReduce, BlockScan, and WarpReduce constantly. It's the building block layer.
Projects¶
- Thrust vs custom kernel — Implement parallel prefix sum (scan) using Thrust, CUB, and a hand-written kernel. Benchmark all three. Understand the abstraction cost.
- CUB histogram — Build a GPU histogram using CUB's
BlockHistogram. Compare withatomicAdd-based approach.
Data Processing Libraries¶
GPU-accelerated data pipelines — critical for feeding models fast enough that the GPU doesn't starve.
| Library | What it does | Replaces |
|---|---|---|
| cuDF | GPU-accelerated DataFrames | pandas, Polars, Spark (zero code changes) |
| cuVS | GPU vector search (nearest neighbors, CAGRA algorithm) | FAISS, Annoy — for RAG, recommendation, semantic search |
| cuML | GPU-accelerated ML algorithms | scikit-learn, UMAP, HDBSCAN (zero code changes) |
| cuOpt | Decision optimization engine (millions of variables) | OR-Tools, Gurobi for routing/scheduling |
| cuGraph | GPU graph analytics | NetworkX — for knowledge graphs, social networks |
| NeMo Curator | Data pipeline for training/fine-tuning LLMs | Custom data cleaning scripts — text, image, video at scale |
| Morpheus | Cybersecurity AI pipeline | Custom SIEM analysis pipelines |
| nvComp | GPU-accelerated compression/decompression | zstd, lz4 — for training data I/O bottlenecks |
| GPUDirect Storage | Direct path: NVMe → GPU memory, bypassing CPU | Standard read() → cudaMemcpy() — eliminates bounce buffers |
| Dask | Distributed computing framework with RAPIDS GPU support | Single-node pandas/scikit-learn — scales to clusters |
Projects¶
- cuDF vs pandas — Load a 10M-row CSV with pandas and cuDF. Benchmark groupby, join, filter operations. Measure speedup.
- GPUDirect Storage benchmark — Compare model loading time with standard I/O vs GDS direct-to-GPU path. Measure throughput in GB/s.
Image and Video Libraries¶
Hardware-accelerated encode/decode and vision preprocessing — these feed the inference pipeline.
| Library | What it does | Use case |
|---|---|---|
| nvImageCodec | GPU image encode/decode (JPEG, PNG, WebP) | High-throughput data loading for training |
| NVIDIA DALI | GPU-accelerated data loading and preprocessing for DL | Training pipelines — replaces CPU-bound DataLoader |
| CV-CUDA | GPU pre/post-processing for vision AI pipelines | Production inference: resize, normalize, color convert on GPU |
| cuCIM | Accelerated image processing for biomedical/geospatial | Whole-slide imaging, satellite imagery |
| NPP (Performance Primitives) | 2D image and signal processing primitives | Filtering, color conversion, morphological operations |
| Video Codec SDK | Hardware NVDEC/NVENC encode and decode | Video analytics, transcoding, DeepStream input |
| Optical Flow SDK | Hardware-accelerated optical flow | Motion estimation, video stabilization, frame interpolation |
Projects¶
- DALI training pipeline — Replace PyTorch DataLoader with DALI for ImageNet training. Measure throughput (images/sec) improvement.
- CV-CUDA inference preprocessing — Build a preprocessing pipeline (resize + normalize + color convert) using CV-CUDA. Compare latency vs OpenCV CPU.
- Video decode pipeline — Decode a 4K video stream using NVDEC (Video Codec SDK) → feed frames to a detection model via TensorRT. Measure end-to-end FPS.
Communication Libraries¶
Multi-GPU and multi-node communication — the bottleneck at scale. These libraries determine how fast your cluster can train and serve models.
| Library | What it does | Key operations | When to use it |
|---|---|---|---|
| NCCL | Multi-GPU/multi-node collectives optimized for NVIDIA hardware | AllReduce, AllGather, ReduceScatter, Broadcast, Send/Recv | Distributed training (gradient sync), model parallelism, inference sharding |
| NVSHMEM | Partitioned global address space across GPU memories | nvshmem_put, nvshmem_get — one-sided remote memory access |
Fine-grained GPU-to-GPU communication without collective overhead |
| NIXL | Low-latency inference transfer library | KV-cache migration, tensor transfer between GPUs/memory tiers | LLM inference serving — moving KV-cache for disaggregated serving |
What to study first¶
NCCL — if you work on distributed training or multi-GPU inference, NCCL is the library you'll interact with most. Understand AllReduce (data parallelism), AllGather (model parallelism), and how to overlap communication with compute.
Projects¶
- NCCL AllReduce benchmark — Run
nccl-testson 2+ GPUs. Measure bandwidth and latency for AllReduce across different message sizes. Compare NVLink vs PCIe topology. - Communication-compute overlap — In a simple 2-GPU training loop, overlap gradient AllReduce with the next forward pass. Measure the throughput improvement vs synchronous AllReduce.
Partner Libraries¶
Community and third-party libraries with GPU acceleration via CUDA.
| Library | What it does |
|---|---|
| OpenCV | Computer vision, image processing, ML — GPU-accelerated modules |
| FFmpeg | Multimedia framework — NVDEC/NVENC hardware codec integration |
| ArrayFire | High-level C++/Python GPU array library |
| CuPy | NumPy/SciPy-compatible GPU array library for Python |
| MAGMA | GPU linear algebra for heterogeneous architectures |
| Gunrock | GPU graph processing library |
Which CUDA-X Libraries to Learn by Role¶
| Your role | Must know | Should know | Nice to have |
|---|---|---|---|
| ML Inference Optimization Engineer | cuDNN, TensorRT, CUTLASS | cuBLAS, NCCL, CV-CUDA, DALI | FlashInfer, nvComp |
| AI Compiler Engineer | CUTLASS, cuDNN (as codegen target), cuBLAS | TensorRT, FlashInfer | CUB, Thrust |
| GPU Runtime Engineer | cuBLAS, NCCL, NVSHMEM, GPUDirect Storage | CUB, Thrust, CUDA Math API | nvComp, NIXL |
| GPU Infrastructure / HPC Engineer | NCCL, NVSHMEM, GPUDirect Storage, Slurm | cuBLAS, cuFFT, nvComp | Dask, cuDF |
| Edge AI / Jetson Engineer | TensorRT, cuDNN, VPI, DeepStream | CV-CUDA, Video Codec SDK, NPP | DALI, GStreamer |
| AI Accelerator Architect | CUTLASS (understand what hardware must support), cuDNN | cuBLAS, TensorRT | FlashInfer, NCCL |
Resources¶
| Resource | URL |
|---|---|
| CUDA-X Library Overview | https://developer.nvidia.com/gpu-accelerated-libraries |
| CUDA Toolkit Documentation | https://docs.nvidia.com/cuda/ |
| cuBLAS Documentation | https://docs.nvidia.com/cuda/cublas/ |
| cuDNN Documentation | https://docs.nvidia.com/deeplearning/cudnn/ |
| CUTLASS GitHub | https://github.com/NVIDIA/cutlass |
| NCCL Documentation | https://docs.nvidia.com/deeplearning/nccl/ |
| TensorRT Documentation | https://docs.nvidia.com/deeplearning/tensorrt/ |
| FlashInfer Documentation | https://docs.flashinfer.ai/ |
| RAPIDS (cuDF, cuML, cuGraph) | https://rapids.ai/ |
| NVIDIA Developer Program | https://developer.nvidia.com/developer-program |