NVIDIA CUDA-X Libraries¶

Parent: Phase 5 — High Performance Computing

Timeline: Ongoing reference — study each category as your projects demand it.

Prerequisites: Phase 1 §4 (C++/CUDA), Phase 4 Track B (Jetson, CUDA runtime), Phase 4 Track C Part 2 (DL inference optimization).

What CUDA-X is¶

CUDA-X is NVIDIA's full suite of GPU-accelerated libraries built on top of the CUDA runtime. While CUDA gives you the programming model (kernels, streams, memory), CUDA-X gives you production-grade implementations of the algorithms you'd otherwise spend months writing: linear algebra, FFTs, neural network primitives, graph analytics, video codecs, multi-GPU communication, and more.

Every layer of the AI chip stack consumes CUDA-X libraries: - L1 (Application): cuDNN, TensorRT, DALI, CV-CUDA, DeepStream - L2 (Compiler): CUTLASS, FlashInfer, cuDNN primitives as codegen targets - L3 (Runtime): cuBLAS, cuFFT, NCCL, NVSHMEM, GPUDirect Storage - L1–L3 (Data): cuDF, cuML, cuGraph, cuVS, NeMo Curator

Understanding what each library does — and when to use it vs. writing a custom kernel — is essential for any GPU infrastructure role.

CUDA Math Libraries¶

The foundation for compute-intensive workloads: molecular dynamics, CFD, medical imaging, seismic exploration, and the matrix math inside every neural network.

Library	What it does	Key APIs / Concepts	When to use it
cuBLAS	GPU-accelerated BLAS (Basic Linear Algebra Subprograms)	GEMM (`cublasSgemm`, `cublasGemmEx`), batched GEMM, mixed-precision (FP16/BF16/FP8 accumulate to FP32)	Any dense matrix multiply — the single most important library for AI inference and training
cuFFT	Fast Fourier Transform on GPU	1D/2D/3D FFTs, real-to-complex, batched FFTs, multi-GPU FFT	Signal processing, audio, spectral analysis, conv via FFT
cuRAND	Random number generation on GPU	Pseudorandom (XORWOW, MRG32k3a, Philox) and quasirandom (Sobol) generators; host and device API	Monte Carlo simulations, dropout in training, data augmentation
cuSOLVER	Dense and sparse direct solvers	LU, Cholesky, QR, SVD, eigenvalue decomposition; sparse Cholesky	Linear systems, least squares, PCA, matrix factorization
cuSPARSE	Sparse matrix operations	SpMV, SpMM, sparse triangular solve; CSR, CSC, COO formats	GNNs, sparse attention, scientific computing with sparse matrices
cuTENSOR	Tensor contractions and operations	Tensor contraction, reduction, element-wise ops on multi-dimensional arrays	Quantum chemistry, tensor networks, high-dimensional contractions
cuDSS	Direct sparse solver	Sparse LU/Cholesky with GPU acceleration	Large sparse systems in structural analysis, circuit simulation
CUDA Math API	Standard math functions	`sin`, `cos`, `exp`, `log`, `sqrt` — GPU-optimized, single and double precision	Any kernel that needs standard math — these are intrinsics, not library calls
AmgX	Algebraic multigrid solver	AMG preconditioner + Krylov solvers for implicit unstructured methods	CFD, structural mechanics, any PDE solver using implicit methods
nvmath-python	Python interface to CUDA math	`nvmath.fft`, `nvmath.linalg` — NumPy-compatible but GPU-accelerated	Rapid prototyping of GPU math in Python without writing C++

What to study first¶

cuBLAS — if you understand GEMM (shapes, leading dimensions, transposition, batching, mixed precision), you understand 80% of AI compute. Start here.

Projects¶

cuBLAS GEMM benchmark — Benchmark cublasGemmEx for FP32, FP16, INT8 on your GPU. Plot TFLOPS vs matrix size. Find the crossover where GPU beats CPU BLAS.
cuFFT spectral analysis — Apply 2D FFT to an image, filter high frequencies, inverse FFT. Compare GPU vs CPU time for 4K images.
cuSOLVER SVD — Compute SVD of a large matrix on GPU. Use it for image compression (keep top-k singular values). Benchmark vs NumPy.

Scientific Computing Libraries¶

For applications requiring neural networks that respect mathematical symmetries — molecular structures, proteins, materials science.

Library	What it does	Domain
cuEquivariance	Accelerate geometry-aware neural networks (rotation/translation equivariance in 3D)	Molecular dynamics, protein folding, materials discovery
NVIDIA ALCHEMI	NIM microservices for chemical and materials discovery	Drug discovery, battery materials, catalysts
cuLitho	Computational lithography acceleration	Semiconductor manufacturing (L7/L8 connection)
cuEST	Electronic structure calculations on GPU	Quantum chemistry, DFT

Physics Libraries¶

GPU-accelerated physics simulation — relevant for robotics simulation, autonomous vehicle testing, and digital twins.

Library	What it does	Use case
NVIDIA Warp	Python framework for GPU physics kernels	Simulation AI, robotics, differentiable physics
NVIDIA PhysicsNeMo	Training and fine-tuning physics AI models	Weather prediction, CFD surrogate models
NVIDIA Earth-2	Weather and climate AI models	Climate modeling, weather forecasting

Quantum Computing Libraries¶

Library	What it does
cuQuantum	Accelerate quantum circuit simulation on GPUs
cuPQC	Post-quantum cryptography acceleration
CUDA-Q QEC	Quantum error correction simulation
CUDA-Q Solvers	Hybrid quantum-classical optimization

Deep Learning Core Libraries¶

The libraries that directly implement neural network inference and training. These are the most critical for AI hardware engineers — they are the primary consumers of GPU compute.

Library	What it does	Layer	Key concepts
cuDNN	Neural network primitives: conv, attention, normalization, pooling, activation	L1/L2	`cudnnConvolutionForward`, `cudnnMultiHeadAttnForward`, graph API for fusion; auto-tuning selects fastest algorithm per hardware
TensorRT	Inference optimizer + runtime: graph optimization, layer fusion, precision calibration, engine serialization	L2/L3	Build phase (optimize graph) → serialize → deploy phase (execute). INT8/FP8 calibration, dynamic shapes, DLA offload
TensorRT-LLM	LLM-specific inference engine: in-flight batching, paged KV-cache, tensor/pipeline parallelism, speculative decoding, FP8/INT4	L2/L3	Compiles LLMs to optimized TRT engines with custom Hopper/Blackwell kernels. Python API for build + deploy. Integrates with Triton for production serving. See Phase 4C Part 2 Unit 05 for deep dive
CUTLASS	C++ template library for custom GEMM, conv, and attention kernels targeting Tensor Cores	L2/L3	CuTe layout DSL, tile-based programming, epilogue fusion, warp-level MMA. The reference for how production GEMM kernels are structured
FlashInfer	Optimized attention and MoE kernels for LLM inference via Python API	L2	Paged KV-cache attention, variable-length sequences, speculative decoding support

What to study first¶

cuDNN — understand what primitives exist and how frameworks (PyTorch, TensorRT) call them
TensorRT — end-to-end inference optimization for vision/small models (covered in Phase 4 Track C Part 2 and Track B §8)
TensorRT-LLM — LLM-specific inference: in-flight batching, paged KV-cache, multi-GPU serving (covered in Phase 4 Track C Part 2 Unit 05)
CUTLASS — when you need to write or modify GEMM/attention kernels (covered in Phase 4 Track C Part 2, unit 02)

Projects¶

cuDNN algorithm benchmark — For a specific conv layer (ResNet-50 first conv), enumerate all cuDNN algorithms (cudnnFindConvolutionForwardAlgorithm). Compare time and workspace for each. Understand why auto-tuning matters.
CUTLASS custom GEMM — Build a CUTLASS GEMM with a fused bias+ReLU epilogue. Benchmark against cuBLAS. Study the CuTe layout to understand tiling.
FlashInfer attention — Run FlashInfer paged attention on a transformer model. Compare latency vs standard PyTorch attention for 4K, 16K, 64K context lengths.

Parallel Algorithm Libraries (CCCL)¶

CUDA Core Compute Libraries — fundamental building blocks for writing GPU algorithms in C++ and Python.

Library	What it does	When to use it
Thrust	C++ STL-like parallel algorithms: sort, reduce, scan, transform	High-level parallel algorithms without writing raw CUDA kernels
CUB	Warp-wide, block-wide, and device-wide cooperative primitives	Inside custom CUDA kernels — block reduce, block scan, radix sort
cuda.compute	Python interface to CCCL device-level algorithms	GPU-accelerated algorithms from Python
cuda.parallel	Standardized sort, scan, reduction primitives	Distributed and local parallel patterns

What to study first¶

CUB — if you write custom CUDA kernels, you'll use CUB's BlockReduce, BlockScan, and WarpReduce constantly. It's the building block layer.

Projects¶

Thrust vs custom kernel — Implement parallel prefix sum (scan) using Thrust, CUB, and a hand-written kernel. Benchmark all three. Understand the abstraction cost.
CUB histogram — Build a GPU histogram using CUB's BlockHistogram. Compare with atomicAdd-based approach.

Data Processing Libraries¶

GPU-accelerated data pipelines — critical for feeding models fast enough that the GPU doesn't starve.

Library	What it does	Replaces
cuDF	GPU-accelerated DataFrames	pandas, Polars, Spark (zero code changes)
cuVS	GPU vector search (nearest neighbors, CAGRA algorithm)	FAISS, Annoy — for RAG, recommendation, semantic search
cuML	GPU-accelerated ML algorithms	scikit-learn, UMAP, HDBSCAN (zero code changes)
cuOpt	Decision optimization engine (millions of variables)	OR-Tools, Gurobi for routing/scheduling
cuGraph	GPU graph analytics	NetworkX — for knowledge graphs, social networks
NeMo Curator	Data pipeline for training/fine-tuning LLMs	Custom data cleaning scripts — text, image, video at scale
Morpheus	Cybersecurity AI pipeline	Custom SIEM analysis pipelines
nvComp	GPU-accelerated compression/decompression	zstd, lz4 — for training data I/O bottlenecks
GPUDirect Storage	Direct path: NVMe → GPU memory, bypassing CPU	Standard `read()` → `cudaMemcpy()` — eliminates bounce buffers
Dask	Distributed computing framework with RAPIDS GPU support	Single-node pandas/scikit-learn — scales to clusters

Projects¶

cuDF vs pandas — Load a 10M-row CSV with pandas and cuDF. Benchmark groupby, join, filter operations. Measure speedup.
GPUDirect Storage benchmark — Compare model loading time with standard I/O vs GDS direct-to-GPU path. Measure throughput in GB/s.

Image and Video Libraries¶

Hardware-accelerated encode/decode and vision preprocessing — these feed the inference pipeline.

Library	What it does	Use case
nvImageCodec	GPU image encode/decode (JPEG, PNG, WebP)	High-throughput data loading for training
NVIDIA DALI	GPU-accelerated data loading and preprocessing for DL	Training pipelines — replaces CPU-bound DataLoader
CV-CUDA	GPU pre/post-processing for vision AI pipelines	Production inference: resize, normalize, color convert on GPU
cuCIM	Accelerated image processing for biomedical/geospatial	Whole-slide imaging, satellite imagery
NPP (Performance Primitives)	2D image and signal processing primitives	Filtering, color conversion, morphological operations
Video Codec SDK	Hardware NVDEC/NVENC encode and decode	Video analytics, transcoding, DeepStream input
Optical Flow SDK	Hardware-accelerated optical flow	Motion estimation, video stabilization, frame interpolation

Projects¶

DALI training pipeline — Replace PyTorch DataLoader with DALI for ImageNet training. Measure throughput (images/sec) improvement.
CV-CUDA inference preprocessing — Build a preprocessing pipeline (resize + normalize + color convert) using CV-CUDA. Compare latency vs OpenCV CPU.
Video decode pipeline — Decode a 4K video stream using NVDEC (Video Codec SDK) → feed frames to a detection model via TensorRT. Measure end-to-end FPS.

Communication Libraries¶

Multi-GPU and multi-node communication — the bottleneck at scale. These libraries determine how fast your cluster can train and serve models.

Library	What it does	Key operations	When to use it
NCCL	Multi-GPU/multi-node collectives optimized for NVIDIA hardware	AllReduce, AllGather, ReduceScatter, Broadcast, Send/Recv	Distributed training (gradient sync), model parallelism, inference sharding
NVSHMEM	Partitioned global address space across GPU memories	`nvshmem_put`, `nvshmem_get` — one-sided remote memory access	Fine-grained GPU-to-GPU communication without collective overhead
NIXL	Low-latency inference transfer library	KV-cache migration, tensor transfer between GPUs/memory tiers	LLM inference serving — moving KV-cache for disaggregated serving

What to study first¶

NCCL — if you work on distributed training or multi-GPU inference, NCCL is the library you'll interact with most. Understand AllReduce (data parallelism), AllGather (model parallelism), and how to overlap communication with compute.

Projects¶

NCCL AllReduce benchmark — Run nccl-tests on 2+ GPUs. Measure bandwidth and latency for AllReduce across different message sizes. Compare NVLink vs PCIe topology.
Communication-compute overlap — In a simple 2-GPU training loop, overlap gradient AllReduce with the next forward pass. Measure the throughput improvement vs synchronous AllReduce.

Partner Libraries¶

Community and third-party libraries with GPU acceleration via CUDA.

Library	What it does
OpenCV	Computer vision, image processing, ML — GPU-accelerated modules
FFmpeg	Multimedia framework — NVDEC/NVENC hardware codec integration
ArrayFire	High-level C++/Python GPU array library
CuPy	NumPy/SciPy-compatible GPU array library for Python
MAGMA	GPU linear algebra for heterogeneous architectures
Gunrock	GPU graph processing library

Which CUDA-X Libraries to Learn by Role¶

Your role	Must know	Should know	Nice to have
ML Inference Optimization Engineer	cuDNN, TensorRT, CUTLASS	cuBLAS, NCCL, CV-CUDA, DALI	FlashInfer, nvComp
AI Compiler Engineer	CUTLASS, cuDNN (as codegen target), cuBLAS	TensorRT, FlashInfer	CUB, Thrust
GPU Runtime Engineer	cuBLAS, NCCL, NVSHMEM, GPUDirect Storage	CUB, Thrust, CUDA Math API	nvComp, NIXL
GPU Infrastructure / HPC Engineer	NCCL, NVSHMEM, GPUDirect Storage, Slurm	cuBLAS, cuFFT, nvComp	Dask, cuDF
Edge AI / Jetson Engineer	TensorRT, cuDNN, VPI, DeepStream	CV-CUDA, Video Codec SDK, NPP	DALI, GStreamer
AI Accelerator Architect	CUTLASS (understand what hardware must support), cuDNN	cuBLAS, TensorRT	FlashInfer, NCCL

Resources¶

Resource	URL
CUDA-X Library Overview	https://developer.nvidia.com/gpu-accelerated-libraries
CUDA Toolkit Documentation	https://docs.nvidia.com/cuda/
cuBLAS Documentation	https://docs.nvidia.com/cuda/cublas/
cuDNN Documentation	https://docs.nvidia.com/deeplearning/cudnn/
CUTLASS GitHub	https://github.com/NVIDIA/cutlass
NCCL Documentation	https://docs.nvidia.com/deeplearning/nccl/
TensorRT Documentation	https://docs.nvidia.com/deeplearning/tensorrt/
FlashInfer Documentation	https://docs.flashinfer.ai/
RAPIDS (cuDF, cuML, cuGraph)	https://rapids.ai/
NVIDIA Developer Program	https://developer.nvidia.com/developer-program