C++ and SIMD (Phase 1 §4 — Sub-Track 1)¶
Parent: C++ and Parallel Computing
Modern C++ gives you high-level syntax with low-level performance. Master the language features first — they are the building blocks of every parallel framework that follows.
Prerequisites: Basic C programming (from Phase 1 §3 OS work). Familiarity with pointers, arrays, functions.
Why Modern C++ First¶
Every parallel computing technology in this curriculum — SIMD intrinsics, OpenMP, oneTBB, CUDA, HIP, SYCL — is built on C++. But not the C++ of the 1990s. Modern C++ (C++11 through C++17) introduced features specifically designed for performance, safety, and parallel programming:
- Lambdas → used in oneTBB, SYCL, modern CUDA, parallel STL
- Move semantics → zero-copy data transfer, critical for GPU memory
- Templates → generic CUDA kernels, CUTLASS, CuTe
- Parallel STL → built-in SIMD + threading with one line of code
constexpr→ compile-time computation for HPC optimization
Learn these features now. You'll use every single one in Sub-Tracks 2–5.
Section 1: Modern C++17 for Parallel Computing¶
1.1 Type Deduction (auto, decltype)¶
auto — let the compiler figure out the type:
auto x = 10; // int
auto y = 3.14; // double
auto z = vec.begin(); // std::vector<int>::iterator — much cleaner
auto reduces verbosity and prevents type mismatch bugs. Use it for local variables, especially with complex types (iterators, template results).
The problem decltype solves¶
In a generic multiply function, what is the return type?
template<typename T, typename U>
??? multiply(T a, U b) {
return a * b;
}
// int * int → int
// int * double → double
// float * double → double
Pre-C++11 workarounds — both inadequate:
// Option 1: force a type — loses precision, not generic
template<typename T, typename U>
double multiply(T a, U b) { return a * b; }
// Option 2: std::common_type — verbose, breaks for custom operators
template<typename T, typename U>
typename std::common_type<T, U>::type multiply(T a, U b) { return a * b; }
C++11 solution — trailing return type with decltype:
The compiler deduces the exact type of the expression a * b — works for built-in types, user-defined operators, and arbitrarily complex template expressions.
C++14 simplification — decltype inferred from the return statement:
template<typename T, typename U>
auto multiply(T a, U b) { // return type deduced automatically
return a * b;
}
No trailing -> decltype(...) needed. The compiler infers from the return expression.
auto vs decltype — the key difference¶
auto strips references and const. decltype preserves exact type semantics.
int x = 5;
int& ref = x;
auto a = ref; // a is int — reference stripped
decltype(ref) b = x; // b is int& — reference preserved
const int c = 10;
auto d = c; // d is int — const stripped
decltype(c) e = c; // e is const int — const preserved
Rule of thumb:
- Use auto → 90% of the time (local variables, loop iterators, lambda captures)
- Use decltype → when you need exact type semantics (template return types, forwarding, trait-style code)
Comparison: old vs modern¶
| Pre-C++11 | C++11 decltype |
C++14 auto return |
|
|---|---|---|---|
| Return type deduction | Manual, error-prone | Exact expression type | Inferred from return |
| Generic correctness | Limited | Exact | Exact |
| Custom operator support | Hard | Yes | Yes |
| Readability | Verbose | Clean | Cleanest |
Why it matters for parallel computing: CUTLASS, CuTe, oneAPI DPC++, and ROCm all use auto + decltype extensively because:
- Kernels are templated on precision (float, half, int8_t)
- Types of intermediate expressions depend on template parameters
- decltype lets the library express "whatever type a*b produces" without hardcoding it
You will see patterns like decltype(auto), std::declval<T>(), and trailing return types throughout CUDA template libraries — recognizing them now saves significant confusion later.
1.2 Smart Pointers (Memory Safety via RAII)¶
The problem with raw pointers:
int* p = new int(5);
// ... 200 lines of code ...
// Did you remember to delete p? Memory leak.
// Did you delete it twice? Undefined behavior.
The solution — RAII (Resource Acquisition Is Initialization):
#include <memory>
// Exclusive ownership — automatically freed when scope ends
auto p = std::make_unique<int>(5);
// No delete needed. Ever.
// Shared ownership — freed when last reference dies
auto q = std::make_shared<int>(10);
auto r = q; // Reference count = 2
// Freed when both q and r go out of scope
Three smart pointer types:
| Type | Ownership | Overhead | Use when |
|---|---|---|---|
std::unique_ptr |
Exclusive (one owner) | Zero (same as raw pointer) | Default choice. Single owner. |
std::shared_ptr |
Shared (reference counted) | Atomic ref count (control block overhead, impl-dependent) | Multiple owners needed |
std::weak_ptr |
Non-owning observer | Minimal | Break circular references |
Rule: Never use new/delete in modern C++. Use make_unique and make_shared.
Breaking Cycles with std::weak_ptr¶
shared_ptr uses reference counting. The count only reaches zero — and the memory only frees — when no shared_ptr points to the object. If two objects hold shared_ptr to each other, neither count ever reaches zero. This is a reference cycle and it silently leaks memory forever.
The cycle problem:
struct Node {
std::shared_ptr<Node> next; // strong reference
int value;
};
auto a = std::make_shared<Node>(); // a: refcount = 1
auto b = std::make_shared<Node>(); // b: refcount = 1
a->next = b; // b refcount = 2
b->next = a; // a refcount = 2
// a and b go out of scope:
// a's refcount drops to 1 (b->next still holds it)
// b's refcount drops to 1 (a->next still holds it)
// Neither reaches 0 → MEMORY LEAK
The fix — weak_ptr for the back-edge:
weak_ptr observes a shared_ptr without incrementing the reference count. The observed object can still be destroyed normally. Before using a weak_ptr, you must lock() it — this returns a shared_ptr that is either valid (object alive) or empty (object was destroyed).
struct Node {
std::shared_ptr<Node> next; // strong: owns the next node
std::weak_ptr<Node> prev; // weak: observes without owning
int value;
};
auto a = std::make_shared<Node>(); // a: refcount = 1
auto b = std::make_shared<Node>(); // b: refcount = 1
a->next = b; // b refcount = 2 (strong)
b->prev = a; // a refcount still = 1 (weak — no increment)
// a and b go out of scope:
// a's refcount drops to 0 → destroyed → a->next released
// b's refcount drops to 0 → destroyed
// No leak.
// Accessing through weak_ptr — always check if still alive:
if (auto owner = b->prev.lock()) { // lock() returns shared_ptr or nullptr
std::cout << "prev value: " << owner->value << "\n";
} else {
std::cout << "prev was destroyed\n";
}
When cycles appear in practice:
| Pattern | Cycle edge | Fix |
|---|---|---|
| Doubly-linked list | prev pointer |
prev as weak_ptr |
| Tree with parent pointer | parent pointer |
parent as weak_ptr |
| Observer / event system | Subscriber holds reference to publisher | Subscriber stores weak_ptr to publisher |
| Computation graph (ML) | Back-edge from output node to input | Back-edges as weak_ptr |
weak_ptr API:
auto sp = std::make_shared<int>(42);
std::weak_ptr<int> wp = sp; // observe, no refcount increment
wp.expired(); // true if the object was destroyed
wp.use_count(); // same as sp.use_count() (0 if expired)
auto sp2 = wp.lock(); // returns shared_ptr<int> or nullptr — ALWAYS use this
if (sp2) { /* safe to use */ }
Do not dereference a
weak_ptrdirectly. Always call.lock()and check the result. The object could be destroyed betweenexpired()returningfalseand your next line.
Why it matters for parallel computing:
- GPU memory (cudaMalloc/cudaFree) follows the same RAII pattern — wrap in a smart pointer or custom RAII class
- Thread-safe reference counting (shared_ptr) is essential for multi-threaded data sharing
- Computation graphs (PyTorch autograd, JAX jaxpr) are graphs with cycles — real implementations use weak back-edges to avoid memory leaks
- Memory leaks in HPC = your 80 GB GPU runs out of memory mid-training
1.3 Range-Based Loops¶
std::vector<float> data = {1.0, 2.0, 3.0, 4.0};
// Modern (clean, less error-prone)
for (auto& x : data) {
x *= 2.0f;
}
// Old style (verbose, index bugs)
for (int i = 0; i < data.size(); i++) {
data[i] *= 2.0f;
}
Why it matters: Range-based loops work naturally with STL algorithms and parallel STL (std::execution::par). They're also easier for the compiler to auto-vectorize.
1.4 Structured Bindings (C++17)¶
Unpack tuples, pairs, and structs directly:
// Pair unpacking
std::pair<int, float> p = {1, 2.5f};
auto [id, value] = p; // id = 1, value = 2.5f
// Map iteration (clean)
std::map<std::string, int> scores = {{"alice", 95}, {"bob", 87}};
for (auto& [name, score] : scores) {
std::cout << name << ": " << score << "\n";
}
// Function returning multiple values
auto [min_val, max_val] = std::minmax_element(data.begin(), data.end());
Why it matters: Cleaner data handling in performance-critical code. Used in profiling results, benchmark data, multi-return-value functions.
1.5 Lambda Functions (Critical for Parallel Computing)¶
Lambdas are the single most important C++ feature for parallel programming. Every parallel framework uses them.
Syntax:
-captures → which variables from the outer scope to bring in
- parameters → like a normal function
- return_type → optional, usually inferred
- body → code block
Basic lambda:
Capture Modes¶
| Mode | Meaning | Thread-safe? | Use when |
|---|---|---|---|
[=] |
All outer vars by value (copy) | Yes | Default for parallel callbacks |
[&] |
All outer vars by reference | No | Single-threaded only |
[var] |
Specific var by value | Yes | Explicit, recommended |
[&var] |
Specific var by reference | No | Read-only, single-threaded |
[=, &var] |
Mixed: most by value, one by ref | Partial | Fine-grained control |
[var = std::move(obj)] |
Move object into lambda | Yes | Transfer ownership (C++14) |
int multiplier = 10;
auto scale_copy = [=](int x) { return x * multiplier; }; // copy — safe in threads
auto scale_ref = [&](int x) { return x * multiplier; }; // ref — dangerous in parallel
auto scale_specific = [multiplier](int x) { return x * multiplier; }; // best practice
Mutable Lambdas¶
By default, values captured by value are const inside the lambda. Use mutable to modify the internal copy without touching the original:
int a = 10;
auto f = [a]() mutable {
a += 5;
std::cout << a; // prints 15
};
f();
std::cout << a; // still 10 — original unchanged
mutable modifies the lambda's own copy, not the outer variable.
Generic Lambdas (C++14)¶
Use auto parameters to make a lambda templated automatically — no explicit template<> needed:
auto add = [](auto a, auto b) { return a + b; };
add(3, 4); // int + int = 7
add(3.14, 2.71); // double + double = 5.85
add(1.0f, 2.0f); // float + float
This is used heavily in parallel STL, oneTBB, and SYCL to write type-generic kernels.
Thread Safety Rules¶
Capture by value [x] — each thread gets its own copy. Safe.
Capture by reference [&x] — all threads share the same variable. Race condition if any thread writes to it.
std::vector<int> data = {1, 2, 3, 4};
int multiplier = 2;
// Safe: each element read is independent, multiplier captured by value
std::for_each(std::execution::par, data.begin(), data.end(),
[multiplier](int& x) { x *= multiplier; });
// Unsafe: if multiplier were captured by ref and modified by any thread
std::for_each(std::execution::par, data.begin(), data.end(),
[&](int& x) { x *= multiplier; }); // race condition if multiplier changes
Rule: prefer
[=]or named value captures in parallel loops. Use[&]only in single-threaded code or when the captured variables are read-only.
Lambda Lifetime — Dangling Reference Trap¶
If a lambda outlives the scope of its captured references, you get a dangling reference:
std::function<int()> make_lambda() {
int local = 42;
return [&]() { return local; }; // DANGLING — local is destroyed when function returns
}
auto f = make_lambda();
f(); // undefined behavior
Fix 1 — capture by value:
Fix 2 — move ownership into the lambda (C++14):
auto buf = std::make_unique<int>(42);
std::thread t([buf = std::move(buf)]() {
std::cout << *buf; // safe: ownership moved into lambda
});
t.join();
[buf = std::move(buf)] transfers ownership of the unique_ptr into the lambda. The lambda now owns the resource — it cannot outlive it.
Move-Into-Lambda — Deep Dive¶
This is worth understanding completely because it appears constantly in parallel and async code.
Why unique_ptr cannot be captured by value:
std::unique_ptr<int> ptr = std::make_unique<int>(42);
auto f = [ptr]() { std::cout << *ptr; }; // DOES NOT COMPILE
// unique_ptr has no copy constructor — copying would violate unique ownership
unique_ptr expresses "exactly one owner". Capture-by-value would require copying the pointer, creating two owners. The compiler forbids it.
Solution — capture by move (C++14):
auto f = [p = std::move(ptr)]() {
std::cout << *p << "\n"; // lambda owns it, safe to use
};
// What happened to ptr?
std::cout << (ptr ? "not empty" : "empty") << "\n"; // "empty" — ptr is now nullptr
f(); // prints 42
Ownership transfer visualized:
Before move:
ptr ──────────────► [ 42 ] (heap)
After [p = std::move(ptr)]:
ptr ──► nullptr
lambda.p ─────────► [ 42 ] (same heap allocation, new owner)
No copy. No new allocation. The lambda took the backpack — the original holder is now empty.
Thread ownership — each thread gets its own resource:
#include <thread>
#include <memory>
int main() {
auto buf = std::make_unique<int[]>(1000); // 4 KB buffer
std::thread t([b = std::move(buf)]() {
b[0] = 42;
std::cout << b[0] << "\n"; // lambda owns the buffer exclusively
});
t.join();
// buf is empty here — the thread owned and (after join) released it
}
The thread lambda owns buf exclusively. No data race. No need for a mutex. When the thread finishes, the lambda destructs and unique_ptr frees the memory automatically.
Moving other moveable types:
Any type with a move constructor works — not just unique_ptr:
std::vector<float> weights(1000000); // 4 MB
std::string config = load_config();
auto handle = open_gpu_context(); // hypothetical RAII GPU handle
// Move all three into a thread lambda — zero copying
std::thread worker([
w = std::move(weights),
cfg = std::move(config),
ctx = std::move(handle)
]() {
// worker owns all three resources
run_inference(w, cfg, ctx);
});
worker.join();
| What you move | Why |
|---|---|
unique_ptr<T> |
Exclusive GPU/CPU resource handle |
vector<float> |
Large weight/activation buffer |
string |
Config or serialized model data |
std::thread |
Transfer thread ownership to lambda |
| Custom RAII handle | GPU context, file, socket |
Rule: If a type is non-copyable (or copying is expensive), and the lambda needs to own it or outlive the current scope, use [x = std::move(obj)].
Device Lambdas (CUDA / SYCL)¶
Modern CUDA and SYCL support lambdas on GPU device code:
// CUDA — device lambda inside kernel
__global__ void kernel(float* a, float* b, int N) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
auto square = [=](float x) { return x * x; }; // [=] only — no refs on device
a[i] = square(b[i]);
}
}
// SYCL — lambda IS the kernel
q.parallel_for(sycl::range{N}, [=](sycl::id<1> i) {
C[i] = A[i] + B[i];
});
[&]is not allowed in GPU device lambdas. Device code cannot reference host memory addresses. Always use[=](capture by value) for GPU lambdas.
Lambda Quick Reference¶
| Feature | Syntax | Notes |
|---|---|---|
| Capture all by value | [=] |
Safe in threads |
| Capture all by ref | [&] |
Dangerous in parallel |
| Specific var by value | [var] |
Best practice |
| Specific var by ref | [&var] |
Single-threaded only |
| Generic (auto params) | [](auto x, auto y){} |
C++14, type-generic |
| Mutable | [x]() mutable {} |
Modify internal copy |
| Move into lambda | [x = std::move(obj)] |
Transfer ownership, C++14 |
| Immediately invoked | [&]() { ... }() |
IIFE — run in-place |
Where lambdas are used in this curriculum:
| Framework | Lambda usage |
|---|---|
| Parallel STL | std::sort(std::execution::par, v.begin(), v.end(), [](auto a, auto b) { return a > b; }); |
| oneTBB | tbb::parallel_for(0, N, [&](int i) { C[i] = A[i] + B[i]; }); |
| OpenMP (C++17) | #pragma omp parallel for + lambda tasks: omp_set_num_threads(N); #pragma omp task |
| CUDA (modern) | Device lambdas in __device__ context — [=] only |
| SYCL | q.parallel_for(range, [=](id<1> i) { C[i] = A[i] + B[i]; }); |
Key message: If you don't understand lambdas and captures, you cannot write parallel code in modern C++.
1.6 std::optional, std::variant, std::any (C++17)¶
std::optional — a value that may or may not exist:
#include <optional>
std::optional<int> find_index(const std::vector<int>& v, int target) {
for (int i = 0; i < v.size(); i++) {
if (v[i] == target) return i;
}
return std::nullopt; // "not found" — no magic numbers like -1
}
auto result = find_index(data, 42);
if (result) {
std::cout << "Found at index " << *result << "\n";
}
std::variant — type-safe union:
union vs std::variant — why the old way is dangerous¶
The old C union stores multiple types in the same memory, but has no idea which type is currently active:
union Data {
int i;
float f;
};
Data d;
d.i = 42;
std::cout << d.f << "\n"; // undefined behavior — reading int bits as float
Memory layout of a union — 4 bytes, no type information:
+------------------+
| 42 (int bits) | <- what's stored
+------------------+
^ no type tag <- compiler has no idea which member is active
Reading the wrong member is undefined behavior. No compiler error, no runtime check — just wrong results or crashes, often appearing only in release builds.
std::variant adds a type tag alongside the stored value:
+------------------+
| 3.14f (float) | <- stored data
+------------------+
type tag: float <- runtime always knows which type is active
#include <variant>
std::variant<int, float, std::string> v;
v = 42; // holds int — type tag = int
v = 3.14f; // holds float — type tag = float
v = "hello"; // holds string — type tag = string
// std::visit dispatches to the right lambda branch based on the type tag
std::visit([](auto&& val) { std::cout << val << "\n"; }, v);
union vs std::variant comparison¶
union |
std::variant |
|
|---|---|---|
| Type tracking | None — programmer's responsibility | Runtime type tag stored alongside value |
| Wrong-type read | Undefined behavior, silent | std::bad_variant_access exception |
| Size overhead | Zero (just the max member size) | Max member size + small tag (usually 1–8 bytes) |
| Parallel safety | Dangerous — type confusion across threads | Safe — each element carries its own type info |
| Use in HPC | Avoid in new code | Heterogeneous arrays, command dispatch, AST nodes |
std::visit — type-dispatching with a lambda¶
std::visit calls your lambda with the actually stored type, not a generic variant:
std::variant<int, float> v = 3.14f;
// Generic lambda — works for any type in the variant
std::visit([](auto&& x) {
std::cout << x << "\n"; // prints 3.14
std::cout << typeid(x).name() << "\n"; // prints "f" (float)
}, v);
// Typed overloads using overload pattern (C++17)
std::visit(overloaded{
[](int x) { std::cout << "int: " << x << "\n"; },
[](float x) { std::cout << "float: " << x << "\n"; },
}, v); // prints: float: 3.14
Parallel example — mixed-type array processed safely¶
#include <variant>
#include <vector>
#include <algorithm>
#include <execution>
std::vector<std::variant<int, float>> data = {1, 2.5f, 3, 4.5f};
// Double every element in parallel — type-safe, no UB
std::for_each(std::execution::par, data.begin(), data.end(),
[](auto& val) {
std::visit([](auto&& x) { x *= 2; }, val);
});
// Output: 2 5 6 9
for (auto& v : data)
std::visit([](auto&& x) { std::cout << x << " "; }, v);
Each element carries its own type tag, so parallel threads never confuse int and float even when accessing adjacent elements.
Why it matters for HPC: Compiler IR nodes, kernel dispatch tables, and mixed-precision computation graphs (FP32/FP16/INT8 hybrid inference) all need to store heterogeneous types safely. std::variant + std::visit replaces the old void*-with-a-type-enum pattern with zero undefined behavior.
1.7 Move Semantics (Performance Core)¶
The problem — expensive copies:
std::vector<float> create_large_buffer() {
std::vector<float> buf(10000000); // 40 MB
// ... fill buffer ...
return buf; // Does this copy 40 MB? NO — move semantics!
}
Copy vs move:
std::vector<int> a = {1, 2, 3, 4, 5};
// Copy: duplicates all data (expensive)
std::vector<int> b = a; // b is a copy, a unchanged
// Move: transfers ownership (cheap — just pointer swap)
std::vector<int> c = std::move(a); // c owns the data, a is now empty
How move works internally:
Before move:
a → [heap: 1, 2, 3, 4, 5] (owns the buffer)
After std::move(a) → c:
a → nullptr (empty, moved-from)
c → [heap: 1, 2, 3, 4, 5] (now owns the buffer)
No data was copied. Just three pointer/size assignments. O(1) instead of O(n).
When to use std::move:
- Returning large objects from functions (compiler often does this automatically — NRVO)
- Transferring ownership of buffers to another thread
- Moving data into containers: vec.push_back(std::move(large_object));
Why it matters for parallel computing:
- GPU memory transfers are expensive. Move semantics = zero-copy mindset.
- Thread pools move work items between threads without copying.
- CUDA's cudaMemcpyAsync + pinned memory is the GPU equivalent of move semantics.
1.8 Multithreading Basics (std::thread, std::mutex)¶
Launching a thread:
#include <thread>
void compute(int id) {
std::cout << "Thread " << id << " running\n";
}
int main() {
std::thread t1(compute, 1);
std::thread t2(compute, 2);
t1.join(); // Wait for t1 to finish
t2.join(); // Wait for t2 to finish
}
Protecting shared data with mutex:
#include <mutex>
std::mutex mtx;
int shared_counter = 0;
void increment(int n) {
for (int i = 0; i < n; i++) {
std::lock_guard<std::mutex> lock(mtx); // RAII lock
shared_counter++;
} // lock released here automatically
}
Better: std::scoped_lock (C++17) for multiple mutexes:
std::mutex m1, m2;
{
std::scoped_lock lock(m1, m2); // Locks both, prevents deadlock
// ... critical section ...
}
std::atomic for simple shared variables:
#include <atomic>
std::atomic<int> counter{0};
void increment(int n) {
for (int i = 0; i < n; i++) {
counter++; // Thread-safe, no mutex needed
}
}
Why it matters: std::thread and std::mutex are the primitives that OpenMP and oneTBB abstract over. Understanding them helps you debug race conditions and deadlocks in any parallel framework.
1.9 STL Algorithms and Parallel STL (C++17)¶
STL algorithms are pre-built, optimized functions that operate on any container. They eliminate manual loops, prevent index bugs, and — with C++17 execution policies — run in parallel with a single extra argument.
Core STL Algorithms¶
| Algorithm | What it does | Example |
|---|---|---|
std::sort |
Sort elements | sort(v.begin(), v.end()) |
std::for_each |
Apply function to every element | for_each(v.begin(), v.end(), f) |
std::transform |
Apply function, write to output | transform(a, a+n, b, out, f) |
std::reduce |
Parallel-safe sum/product/etc. | reduce(v.begin(), v.end()) |
std::find_if |
Search by predicate | find_if(v.begin(), v.end(), pred) |
std::minmax_element |
Min and max in one pass | auto [lo, hi] = minmax_element(...) |
std::count_if |
Count matching elements | count_if(v.begin(), v.end(), pred) |
std::copy_if |
Filtered copy | copy_if(src, src+n, dst, pred) |
Sequential example — sort and iterate¶
#include <vector>
#include <algorithm>
#include <iostream>
std::vector<int> v = {4, 2, 7, 1, 5};
std::sort(v.begin(), v.end());
for (auto x : v) std::cout << x << " "; // 1 2 4 5 7
Lambda + algorithm — custom behavior injected¶
std::vector<int> v = {1, 2, 3, 4, 5};
// Multiply every element by 2 — no loop index, no off-by-one
std::for_each(v.begin(), v.end(), [](int& x) { x *= 2; });
// v = {2, 4, 6, 8, 10}
// Transform: square each element into a new vector
std::vector<int> sq(v.size());
std::transform(v.begin(), v.end(), sq.begin(), [](int x) { return x * x; });
// Find first element > 5
auto it = std::find_if(v.begin(), v.end(), [](int x) { return x > 5; });
Parallel STL — one argument changes everything¶
C++17 execution policies parallelize any STL algorithm:
#include <algorithm>
#include <execution>
#include <vector>
std::vector<float> data(10'000'000);
// Sequential (default, single core)
std::sort(std::execution::seq, data.begin(), data.end());
// Parallel (multi-threaded across all cores)
std::sort(std::execution::par, data.begin(), data.end());
// Parallel + SIMD (threads + vectorization)
std::sort(std::execution::par_unseq, data.begin(), data.end());
Execution policies:
| Policy | Behavior | When to use |
|---|---|---|
seq |
Sequential | Small data, debugging |
par |
Multi-threaded | Large data, independent elements |
par_unseq |
Multi-threaded + SIMD | Maximum throughput, no dependencies |
unseq (C++20) |
SIMD only (single thread) | Vectorizable, single core |
Parallel transform, reduce, for_each¶
// Parallel transform: c[i] = a[i] + b[i]
std::transform(std::execution::par,
a.begin(), a.end(), b.begin(), c.begin(),
[](float x, float y) { return x + y; });
// Parallel reduce: sum all elements
float total = std::reduce(std::execution::par, data.begin(), data.end());
// Parallel reduce with custom op: product
float product = std::reduce(std::execution::par,
data.begin(), data.end(), 1.0f, std::multiplies<float>{});
// Parallel for_each + SIMD: sqrt every element
std::for_each(std::execution::par_unseq, data.begin(), data.end(),
[](float& x) { x = std::sqrt(x); });
Manual loop vs STL vs Parallel STL¶
| Manual loop | STL + lambda | Parallel STL | |
|---|---|---|---|
| Code length | Long | Short | Short |
| Index bug risk | Medium | None | None |
| Parallelism | Manual (std::thread) |
Manual | Automatic |
| SIMD | Compiler may vectorize | Compiler may vectorize | par_unseq forces it |
| Debugging | Hard | Easy | Harder (non-deterministic) |
par_unseqrequirements: Loop body must have no data races AND no loop-carried dependencies (a[i]must not reada[i-1]). Violating either produces undefined behavior — not a compile error, not even consistent wrong results.Warning — not always faster: Parallel STL has overhead (thread pool spin-up, synchronization). For small arrays (< ~100K elements),
seqis often faster. In HPC and production ML systems, teams frequently prefer OpenMP or oneTBB for more predictable, tunable, and debuggable parallelism. Benchmark before committing toparorpar_unseq.
Compiler support: GCC 9+ (with -ltbb), MSVC 2017+, Intel DPC++. Clang support is improving.
1.10 Templates & Generic Programming¶
Function templates:
template<typename T>
T add(T a, T b) {
return a + b;
}
add(3, 4); // T = int
add(3.14, 2.71); // T = double
Class templates:
template<typename T, int N>
struct Vector {
T data[N];
T& operator[](int i) { return data[i]; }
};
Vector<float, 4> v; // 4-element float vector
Why templates matter for parallel computing:
- CUDA kernels are often templated on data type and tile size: matmul<float, 32><<<grid, block>>>(...)
- CUTLASS is entirely template-based — configurable GEMM dimensions, precisions, and tiling
- CuTe (CUDA Template Engine) uses templates for layout and copy abstractions
- Zero-cost abstraction: templates are resolved at compile time, no runtime overhead
1.11 constexpr (Compile-Time Computation)¶
Move computation from runtime to compile time:
constexpr int square(int x) {
return x * x;
}
constexpr int tile_size = square(16); // Computed at compile time = 256
// Use in array declarations, template parameters, etc.
float buffer[tile_size]; // float buffer[256] — no runtime cost
constexpr + if (C++17):
template<typename T>
void process(T value) {
if constexpr (std::is_integral_v<T>) {
// Integer-specific code — compiled away for floats
} else {
// Float-specific code — compiled away for integers
}
}
Why it matters for HPC:
- Tile sizes, block dimensions, and buffer sizes in GPU kernels are often compile-time constants
- constexpr enables the compiler to optimize aggressively (unroll loops, eliminate branches)
- Used extensively in CUTLASS for configurable kernel parameters
1.12 How C++ Features Map to Parallel Frameworks¶
| C++ Feature | Used in | How |
|---|---|---|
| Lambdas | oneTBB, SYCL, Parallel STL, modern CUDA | Kernel functions, task bodies, parallel_for callbacks |
| Move semantics | GPU memory, thread pools | Zero-copy buffer transfer, work item passing |
| Templates | CUDA, CUTLASS, CuTe, CK | Configurable kernels, generic algorithms |
constexpr |
CUDA, HLS, HPC libraries | Compile-time tile sizes, buffer dimensions |
auto |
Everywhere | Complex return types, template deduction |
| Smart pointers | Resource management | RAII wrappers for GPU memory, file handles |
| Parallel STL | CPU parallelism | One-line SIMD + threading |
std::thread/mutex |
Manual threading | Foundation for OpenMP, oneTBB internals |
std::atomic |
Lock-free algorithms | Counters, flags in multi-threaded code |
| Structured bindings | Data processing | Clean iteration over results, tuples |
Section 2: SIMD on CPUs¶
The gateway to GPU thinking — same operation, multiple data.
Now that you have the C++ foundation, apply it to the first level of parallelism: SIMD vector instructions.
What SIMD Is¶
CPU vector instructions that process 4, 8, 16, or 32 data elements in one instruction:
| Instruction Set | Width | Data per instruction | Platform |
|---|---|---|---|
| SSE (1999) | 128-bit | 4x float32 | x86 |
| AVX (2011) | 256-bit | 8x float32 | x86 |
| AVX-512 (2017) | 512-bit | 16x float32 | x86 (server) |
| ARM NEON | 128-bit | 4x float32 | ARM (Jetson, mobile) |
From SIMD to GPU — One Idea, Three Expressions¶
Every level of the parallel stack applies the same operation to multiple data elements at once. What changes is the width, the programming model, and who manages the indices.
Take the simplest possible operation: c[i] = a[i] + b[i]. Here is how you express it at each level:
Scalar (1 element at a time):
c[0] = a[0] + b[0]
c[1] = a[1] + b[1] ← you write a loop, CPU executes one add per cycle
c[2] = a[2] + b[2]
...
CPU SIMD / AVX (8 elements at a time):
c[0..7] = a[0..7] + b[0..7] ← you write i += 8, CPU runs 8 adds in one instruction
c[8..15] = ...
GPU warp / CUDA (32 elements at a time):
each thread handles one i ← you write kernel for one element,
32 threads run in lock-step GPU runs 32 threads simultaneously (one warp)
c[0..31] = a[0..31] + b[0..31]
GPU grid (thousands at a time):
all threads launch at once ← GPU runs thousands of warps across SM clusters
// ── Scalar ──────────────────────────────────────────────────────
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i]; // 1 add/cycle
// ── CPU SIMD (AVX) ───────────────────────────────────────────────
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
_mm256_storeu_ps(&c[i], _mm256_add_ps(va, vb)); // 8 adds/cycle
}
// ── GPU (CUDA) ───────────────────────────────────────────────────
__global__ void add(float* a, float* b, float* c, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x; // each thread owns one i
if (i < n) c[i] = a[i] + b[i]; // 32 threads run this simultaneously
}
The key differences — not a new idea, just new mechanics:
| Scalar | CPU SIMD (AVX) | GPU (CUDA warp) | |
|---|---|---|---|
| Width | 1 element | 8 floats | 32 threads |
Who picks i |
Your loop counter | Your stride (+= 8) |
Hardware (threadIdx + blockIdx) |
| Fast memory | L1/L2 cache | L1/L2 cache | Shared memory (48 KB/SM) |
| Slow memory | RAM | RAM | Global memory (VRAM) |
| Hiding latency | Out-of-order execution | ILP across loop iterations | Warp switching |
| Divergence cost | Branch misprediction | Breaks vectorization | Serializes warp lanes |
The scaling ladder:
AVX2: 8 floats × ~4 GHz × 2 (FMA) = ~64 GFLOPS / core
GPU warp: 32 threads × thousands of warps → tens of TFLOPS
SIMD is the mental foundation. Once you understand why i += 8 processes 8 elements at once, the GPU model — "just make every thread a lane, and launch millions" — follows naturally.
Auto-Vectorization (Compiler Does It)¶
The easiest path — write a simple loop, let the compiler vectorize:
// This loop is auto-vectorizable
void add_vectors(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
# Check if compiler vectorized your loop
g++ -O2 -march=native -fopt-info-vec-optimized my_code.cpp
# Output: "loop vectorized using 32 byte vectors" = success
Rules for auto-vectorization:
- No loop-carried dependencies (a[i] = a[i-1] + 1 cannot vectorize)
- Simple data types (float, int — not complex objects)
- Aligned memory helps (alignas(32) float data[1024];)
- Use -O2 or -O3 with -march=native
Manual Intrinsics (Full Control)¶
When auto-vectorization fails or you need specific instructions:
#include <immintrin.h>
void add_vectors_avx(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]); // Load 8 floats from a
__m256 vb = _mm256_load_ps(&b[i]); // Load 8 floats from b
__m256 vc = _mm256_add_ps(va, vb); // Add 8 pairs at once
_mm256_store_ps(&c[i], vc); // Store 8 results to c
}
}
Key intrinsics patterns:
| Operation | Intrinsic | What it does |
|---|---|---|
| Load aligned | _mm256_load_ps(ptr) |
Load 8 32-byte-aligned floats — crashes if misaligned |
| Load unaligned | _mm256_loadu_ps(ptr) |
Load 8 floats (any alignment) — safer default |
| Store | _mm256_store_ps(ptr, v) |
Store 8 floats (aligned) |
| Stream store | _mm256_stream_ps(ptr, v) |
Write bypassing cache — use for large write-once buffers |
| Add | _mm256_add_ps(a, b) |
8 parallel additions |
| Multiply | _mm256_mul_ps(a, b) |
8 parallel multiplications |
| FMA | _mm256_fmadd_ps(a, b, c) |
8 parallel a*b + c — one instruction, not two; also more precise |
| Broadcast | _mm256_set1_ps(x) |
Fill all 8 lanes with same value |
| Horizontal add | _mm256_hadd_ps(a, b) |
Add adjacent pairs within each 128-bit half |
| Compare | _mm256_cmp_ps(a, b, _CMP_GT_OS) |
Per-lane compare → all-0s or all-1s mask |
| Blend | _mm256_blendv_ps(a, b, mask) |
Select per-lane: mask[i] ? b[i] : a[i] |
| Permute (cross-lane) | _mm256_permutevar8x32_ps(v, idx) |
Full cross-lane reorder by index vector |
| Gather | _mm256_i32gather_ps(base, idx, 4) |
Load from non-contiguous addresses |
_mm256_load_psalignment: Ifptris not 32-byte aligned, this generates a segfault (#GPfault) at runtime — not a compile error. When in doubt, use_mm256_loadu_ps(theu= unaligned). On modern CPUs the performance difference is negligible.FMA is the core of deep learning:
_mm256_fmadd_ps(a, b, c)computesa*b + cin a single fused instruction. GEMM (matrix multiplication), convolutions, and transformer attention inner loops are entirely composed of FMA. This is why FLOPS counts are usually reported as "FLOPS of FMA." On AVX2, 8-wide FMA at 2 ops/cycle = 16 GFLOPS/GHz per core.
Aligned memory:
// Aligned allocation (required for _mm256_load_ps)
alignas(32) float data[1024];
// Or dynamic allocation
float* data = (float*)aligned_alloc(32, 1024 * sizeof(float));
Top 10 Most-Used Intrinsics in Real Codebases¶
Analysis of SIMD usage across GitHub repositories (production ML frameworks, databases, parsers) shows the same ~10 intrinsics appear in 80%+ of vectorized code. These are the ones worth memorizing first.
| Rank | Intrinsic | ISA | What it does | Why it's everywhere |
|---|---|---|---|---|
| 1 | _mm256_add_ps(a, b) |
AVX | Add 8 float32 | Core of every FP loop |
| 2 | _mm256_loadu_ps(ptr) |
AVX | Load 8 floats, unaligned | Default load — no alignment constraint |
| 3 | _mm256_storeu_ps(ptr, v) |
AVX | Store 8 floats, unaligned | Default store |
| 4 | _mm256_fmadd_ps(a, b, c) |
FMA+AVX2 | a*b + c in one instruction |
GEMM, convolution, attention |
| 5 | _mm_add_ps(a, b) |
SSE | Add 4 float32 (128-bit) | SSE compat code, horizontal ops |
| 6 | _mm_set1_epi32(x) |
SSE2 | Broadcast int to all 4 lanes | Loading integer constants |
| 7 | _mm256_cmp_ps(a, b, pred) |
AVX | Compare → bitmask | Conditional selection, NaN handling |
| 8 | _mm_loadu_si128(ptr) |
SSE2 | Load 128-bit integer vector | String/byte processing |
| 9 | _mm_shuffle_epi32(v, imm) |
SSE2 | Reorder 32-bit int lanes | Data rearrangement, AoS→SoA |
| 10 | _mm_popcnt_u32(x) |
SSE4.2 | Count set bits in 32-bit int | Database filters, Hamming distance |
Patterns behind the rankings:
_ps(packed single) dominates — float32 is the workhorse of ML inference and scientific computing. Learn_psvariants first.- Unaligned loads/stores are preferred (
_loadu_,_storeu_) — on Haswell and newer, the penalty for cache-line-crossing loads is ≤1 cycle. The code simplicity is worth it. - SSE2 128-bit still common — many codebases use SSE2 for compatibility or for 128-bit sub-operations (horizontal reduction, remainder loops, byte processing).
_mm_shuffle_epi32is the workhorse for integer rearrangement — appears in virtually every AoS→SoA conversion, hash function, and string parser._mm_popcnt_u32/u64is an outlier — not strictly a SIMD register instruction (operates on scalar registers), but uses the SSE4.2 ISA feature bit and dominates database and bit-manipulation code (Hamming distance, counting set bits in filters).
// The most common inner loop pattern in real ML code (ranks 2, 4, 3):
for (int i = 0; i < n; i += 8) {
__m256 a = _mm256_loadu_ps(&A[i]); // rank 2
__m256 b = _mm256_loadu_ps(&B[i]); // rank 2
acc = _mm256_fmadd_ps(a, b, acc); // rank 4
}
_mm256_storeu_ps(out, acc); // rank 3
Source: Usage statistics from simd.info intrinsic statistics and analysis of GitHub repositories including production ML frameworks, databases, and codec libraries.
Intrinsics by Functional Category¶
A different lens — grouped by what you're trying to do, with both SSE (128-bit) and AVX (256-bit) versions side by side. This is how you look them up when solving a problem.
1. Data Movement & Initialization¶
// Load — unaligned is the default today (negligible penalty on modern CPUs)
__m128 a = _mm_loadu_ps(ptr); // SSE: 4 floats
__m256 b = _mm256_loadu_ps(ptr); // AVX: 8 floats
// Store — write register back to memory
_mm_storeu_ps(ptr, a); // SSE
_mm256_storeu_ps(ptr, b); // AVX
// Broadcast — fill all lanes with one scalar (scaling, bias addition)
__m128 scale = _mm_set1_ps(2.0f); // SSE: [2, 2, 2, 2]
__m256 scale = _mm256_set1_ps(2.0f); // AVX: [2, 2, 2, 2, 2, 2, 2, 2]
// Set individual lanes (note: highest lane first)
__m256 v = _mm256_set_ps(7,6,5,4,3,2,1,0); // reverse order
__m256 v = _mm256_setr_ps(0,1,2,3,4,5,6,7); // natural order (r = reversed param order)
2. Arithmetic¶
// Add / Subtract / Multiply / Divide
__m256 r = _mm256_add_ps(a, b); // a[i] + b[i]
__m256 r = _mm256_sub_ps(a, b); // a[i] - b[i]
__m256 r = _mm256_mul_ps(a, b); // a[i] * b[i]
__m256 r = _mm256_div_ps(a, b); // a[i] / b[i] (slow — ~20 cycles)
// FMA — the most important arithmetic intrinsic
__m256 r = _mm256_fmadd_ps(a, b, c); // a[i]*b[i] + c[i], single instruction
// Integer arithmetic (add 8 × int32)
__m256i r = _mm256_add_epi32(a, b);
__m256i r = _mm256_mullo_epi32(a, b); // low 32 bits of each 32×32 product
_ps= packed single (float32)._pd= packed double._epi32= packed int32. Learn the suffix system once and every intrinsic name becomes self-describing.
3. Comparison & Selection (Branchless Conditionals)¶
SIMD has no branches. The fundamental pattern: compare → mask → blend.
// Step 1: Compare → produces per-lane all-1s (true) or all-0s (false)
__m256 mask = _mm256_cmp_ps(a, b, _CMP_GT_OS); // mask[i] = a[i] > b[i] ? 0xFFFFFFFF : 0
// SSE equivalent with explicit predicate:
__m128 mask = _mm_cmplt_ps(a, b); // mask[i] = a[i] < b[i]
// Step 2: Blend — select elements using mask
__m256 result = _mm256_blendv_ps(if_false, if_true, mask);
// result[i] = mask[i] ? if_true[i] : if_false[i]
// Practical example: clamp to [lo, hi] without any branch
__m256 clamped = _mm256_min_ps(_mm256_max_ps(v, lo), hi);
// Practical example: absolute value without branch
__m256 sign_mask = _mm256_set1_ps(-0.0f); // sign bit only
__m256 abs_v = _mm256_andnot_ps(sign_mask, v); // clear sign bit = |v|
Comparison predicates for
_mm256_cmp_ps:_CMP_EQ_OS,_CMP_LT_OS,_CMP_LE_OS,_CMP_GT_OS,_CMP_GE_OS,_CMP_NEQ_OS,_CMP_UNORD_Q(NaN check). The_OSsuffix = ordered, signaling (raises exception on NaN). Use_OQfor quiet (no exception).
4. Data Shuffling & Reduction¶
// Shuffle — rearrange within 128-bit half (compile-time imm8 control)
__m128 s = _mm_shuffle_ps(a, b, _MM_SHUFFLE(3,2,1,0)); // SSE: pick 4 elements from a and b
__m256 s = _mm256_shuffle_ps(a, b, imm8); // AVX: operates per 128-bit half!
// Horizontal add — adds adjacent pairs (use sparingly, ~3 cycles each)
__m128 h = _mm_hadd_ps(a, b); // [a0+a1, a2+a3, b0+b1, b2+b3]
// Better horizontal sum pattern (see "Horizontal sum" section above)
// hadd is slow — only use it 1-2 times at the very end of a reduction loop
_mm_hadd_pswarning: Horizontal add is one of the slowest SIMD instructions (~3-5 cycles vs 0.5 for_mm_add_ps). Never put it inside a loop. Use it once at the end to collapse an accumulator register — or better, use thehsumidiom from the Inspecting SIMD Registers section.
SSE vs AVX Side-by-Side Reference¶
| Category | SSE (128-bit, 4× float) | AVX (256-bit, 8× float) | Purpose |
|---|---|---|---|
| Load | _mm_loadu_ps(ptr) |
_mm256_loadu_ps(ptr) |
Read from memory |
| Store | _mm_storeu_ps(ptr, v) |
_mm256_storeu_ps(ptr, v) |
Write to memory |
| Broadcast | _mm_set1_ps(x) |
_mm256_set1_ps(x) |
Fill all lanes with scalar |
| Add | _mm_add_ps(a, b) |
_mm256_add_ps(a, b) |
Element-wise addition |
| Multiply | _mm_mul_ps(a, b) |
_mm256_mul_ps(a, b) |
Element-wise multiply |
| FMA | _mm_fmadd_ps(a, b, c) |
_mm256_fmadd_ps(a, b, c) |
a*b + c |
| Compare | _mm_cmplt_ps(a, b) |
_mm256_cmp_ps(a, b, pred) |
Vectorized compare → mask |
| Blend | _mm_blendv_ps(a, b, m) |
_mm256_blendv_ps(a, b, m) |
Vectorized if/else |
| Shuffle | _mm_shuffle_ps(a, b, i) |
_mm256_shuffle_ps(a, b, i) |
Rearrange elements |
| Hadd | _mm_hadd_ps(a, b) |
_mm256_hadd_ps(a, b) |
Adjacent-pair add |
When to use SSE vs AVX:
- AVX for throughput — 2× the data per instruction, same latency
- SSE for compatibility — SSE2 is available on every x86-64 CPU since ~2003
- SSE for remainder loops — after the main AVX loop processes n/8*8 elements, handle the last n%8 with SSE or scalar
Minimal Examples — Each Intrinsic in Isolation¶
Before combining intrinsics, see each one do exactly one thing:
#include <immintrin.h>
// ── Data Movement ──────────────────────────────────────────────────────────
// loadu: load 8 floats from any address
float data[8] = {1, 2, 3, 4, 5, 6, 7, 8};
__m256 v = _mm256_loadu_ps(data); // v = [1, 2, 3, 4, 5, 6, 7, 8]
// set1: broadcast scalar to every lane
__m256 v_pi = _mm256_set1_ps(3.14f); // v_pi = [3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14]
// storeu: write register back to array
float out[8];
_mm256_storeu_ps(out, v); // out[] = {1, 2, 3, 4, 5, 6, 7, 8}
// ── Arithmetic ─────────────────────────────────────────────────────────────
__m256 v1 = _mm256_set1_ps(2.0f); // [2, 2, 2, 2, 2, 2, 2, 2]
__m256 v2 = _mm256_set1_ps(3.0f); // [3, 3, 3, 3, 3, 3, 3, 3]
__m256 sum = _mm256_add_ps(v1, v2); // [5, 5, 5, 5, 5, 5, 5, 5]
__m256 product = _mm256_mul_ps(v1, v2); // [6, 6, 6, 6, 6, 6, 6, 6]
__m256 fma_res = _mm256_fmadd_ps(v1, v2, v1); // v1*v2 + v1 = [8, 8, 8, 8, 8, 8, 8, 8]
// ── Comparison & Selection (vectorized if/else) ────────────────────────────
__m256 a = _mm256_setr_ps(1,5,3,7,2,6,4,8);
__m256 b = _mm256_set1_ps(4.0f);
__m256 mask = _mm256_cmp_ps(a, b, _CMP_GT_OS); // mask[i] = a[i] > 4 ? 0xFFFFFFFF : 0
// [0, 1, 0, 1, 0, 1, 0, 1]
__m256 result = _mm256_blendv_ps(b, a, mask); // result[i] = mask[i] ? a[i] : b[i]
// [4, 5, 4, 7, 4, 6, 4, 8]
// ── Shuffling ──────────────────────────────────────────────────────────────
__m256 sv = _mm256_setr_ps(0,1,2,3,4,5,6,7);
// _MM_SHUFFLE(d,c,b,a) → lane 0 = src[a], lane 1 = src[b], lane 2 = src[c], lane 3 = src[d]
__m256 shuffled = _mm256_shuffle_ps(sv, sv, _MM_SHUFFLE(0,1,2,3));
// Result: [3,2,1,0, 7,6,5,4] — reversed within each 128-bit half
// ── Horizontal add (slow — only at loop end) ───────────────────────────────
__m256 hv = _mm256_setr_ps(1,2,3,4,5,6,7,8);
__m256 hsum = _mm256_hadd_ps(hv, hv);
// [1+2, 3+4, 1+2, 3+4 | 5+6, 7+8, 5+6, 7+8] = [3, 7, 3, 7, 11, 15, 11, 15]
Complete Example: Scaled Vector Addition¶
All five core intrinsics working together in a realistic loop — the pattern used in neural network activations, image processing, and signal scaling:
#include <immintrin.h>
// Computes: c[i] = (a[i] + b[i]) * scale for all i
// Processes 8 elements per iteration using AVX
void scaled_add(const float* a, const float* b, float* c, float scale, int n) {
// 1. set1: broadcast the scalar constant once, outside the loop
__m256 v_scale = _mm256_set1_ps(scale);
int i = 0;
for (; i <= n - 8; i += 8) {
// 2. loadu: load 8 elements from each input array
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
// 3. add_ps: element-wise addition
__m256 vsum = _mm256_add_ps(va, vb);
// 4. mul_ps: scale the result
// (alternatively: _mm256_fmadd_ps(va, v_scale, _mm256_mul_ps(vb, v_scale)))
__m256 vres = _mm256_mul_ps(vsum, v_scale);
// 5. storeu: write 8 results back to memory
_mm256_storeu_ps(&c[i], vres);
}
// Scalar remainder for elements that don't fill a full AVX register
for (; i < n; i++) {
c[i] = (a[i] + b[i]) * scale;
}
}
What this demonstrates:
- _mm256_set1_ps outside the loop — compute constants once, reuse every iteration
- _mm256_loadu_ps — no alignment constraint needed
- _mm256_add_ps → _mm256_mul_ps — two arithmetic ops, 8 elements each, back-to-back
- _mm256_storeu_ps — write results
- Remainder loop — handle n % 8 trailing elements cleanly
FMA version — if a and b have already been scaled separately, fmadd collapses two ops:
Cross-Lane Limitations (The AVX2 Gotcha)¶
Most AVX2 "256-bit" instructions actually execute as two independent 128-bit halves. This is the single most surprising architectural quirk in AVX2 and causes subtle bugs when you expect full cross-lane operations.
// shuffle_ps looks like it works on 256-bit, but it only shuffles within each 128-bit half:
__m256 v = {7, 6, 5, 4, 3, 2, 1, 0}; // lanes 0-7
__m256 s = _mm256_shuffle_ps(v, v, 0b00011011);
// Result: {4,5,6,7, 0,1,2,3} — reversed within each half, NOT across the full register
// To truly cross lanes, you need explicit cross-lane moves:
// Swap the two 128-bit halves:
__m256 swapped = _mm256_permute2f128_ps(v, v, 0x01);
// Full cross-lane element reorder (AVX2) — the most flexible option:
__m256i idx = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7); // reverse order
__m256 reversed = _mm256_permutevar8x32_ps(v, idx); // true 8-element permute
| Operation | Cross-lane? | Notes |
|---|---|---|
_mm256_shuffle_ps |
No | Within each 128-bit half only |
_mm256_hadd_ps |
No | Adjacent pairs within each half |
_mm256_permute2f128_ps |
Yes | Swaps/copies 128-bit halves |
_mm256_permutevar8x32_ps |
Yes | Full 8-element permute (AVX2 only) |
Rule of thumb: If an intrinsic says "256-bit" but was available in AVX (not AVX2), suspect it operates on two 128-bit halves. Verify with Intel Intrinsics Guide or uops.info.
Inspecting SIMD Registers (Debugging)¶
You can't printf a __m256. The standard pattern is a union, which lets you read individual lanes:
// Union for safe lane inspection (from hands-on-simd-programming)
union float8 {
__m256 v;
float a[8];
float8() : v(_mm256_setzero_ps()) {}
float8(__m256 x) : v(x) {}
};
// Usage
float8 result(_mm256_fmadd_ps(va, vb, vc));
printf("lane 0: %f, lane 3: %f\n", result.a[0], result.a[3]);
// Or with structured binding in C++17:
auto [l0,l1,l2,l3,l4,l5,l6,l7] = result.a;
Horizontal sum — reducing a vector to a scalar:
This pattern appears everywhere: dot products, attention scores, reductions.
// Sum all 8 floats in a __m256 to a scalar
float hsum(__m256 v) {
// Step 1: add pairs within each 128-bit half → [a+e, b+f, c+g, d+h | a+e, b+f, c+g, d+h]
__m128 lo = _mm256_castps256_ps128(v); // lower 128 bits
__m128 hi = _mm256_extractf128_ps(v, 1); // upper 128 bits
__m128 sum = _mm_add_ps(lo, hi); // add the two halves
// Step 2: horizontal adds until scalar
sum = _mm_hadd_ps(sum, sum); // [a+b+e+f, c+d+g+h, ...]
sum = _mm_hadd_ps(sum, sum); // [total, total, ...]
return _mm_cvtss_f32(sum); // extract lane 0
}
Stream stores for large write-once buffers:
// Non-temporal store: bypasses CPU cache entirely
// Use when writing large output buffers you won't read back soon
// (e.g., preprocessing output, frame buffer writes)
for (int i = 0; i < n; i += 8) {
__m256 result = /* compute */;
_mm256_stream_ps(&out[i], result); // skip cache, write direct to memory
}
_mm_sfence(); // memory fence — ensure all stream writes are visible
// When to use: write-once buffers > L3 cache size
// When NOT: if you'll read the data back soon (defeats the purpose)
Gather — loading from non-contiguous addresses:
// Gather: load from base_ptr + indices[i] * scale
float table[1024] = { /* lookup table */ };
__m256i indices = _mm256_set_epi32(7, 3, 15, 0, 42, 8, 1, 100); // arbitrary indices
__m256 gathered = _mm256_i32gather_ps(table, indices, 4); // scale=4 (sizeof float)
// Note: gather throughput ≈ scalar loads — no bandwidth benefit for the gather itself.
// The value: the *computation* after gather runs vectorized (8 FMAs instead of 8 scalar ops).
// Good for: embedding lookups, sparse vector ops, non-sequential access patterns.
Cache Lines and Memory Alignment¶
The CPU fetches memory in 64-byte cache lines — not individual bytes. This is the physical reason SIMD alignment matters and the root cause of most false-sharing bugs in multithreaded code.
Cache line = 64 bytes = 16 floats = 8 doubles = 2 AVX registers
[float 0][float 1]...[float 15] ← one cache line fetch
└─ AVX load ──┘└─ AVX load ──┘ ← if misaligned, spans two lines
Why it matters:
| Scenario | Cache line impact |
|---|---|
_mm256_load_ps aligned |
One cache line touched per load |
_mm256_loadu_ps misaligned |
May touch two cache lines — up to 2× memory traffic |
| False sharing (threads) | Two threads write different variables in same 64-byte line → cache ping-pong |
alignas(64) struct |
Guarantees struct starts at cache line boundary |
// False sharing — avoid this
struct Workers {
int counter_a; // same 64-byte line as counter_b
int counter_b;
};
// Fix: pad to cache line boundary
struct alignas(64) WorkerA { int counter; };
struct alignas(64) WorkerB { int counter; };
Rule: Align SIMD buffers to 32 bytes (alignas(32)) for AVX. Align per-thread data to 64 bytes (alignas(64)) to prevent false sharing.
Data Layout: AoS vs SoA¶
How you arrange data in memory is often the single highest-leverage optimization in SIMD and GPU code — more impactful than instruction selection.
Array of Structs (AoS) — natural, object-oriented, SIMD-unfriendly¶
Each particle is one contiguous object. Easy to pass around, easy to reason about.
struct Particle {
float x, y, z; // position
float vx, vy, vz; // velocity
};
Particle particles[4];
Memory layout — fields interleaved across all particles:
particle 0 particle 1 particle 2 particle 3
[x][y][z][vx][vy][vz][x][y][z][vx][vy][vz][x][y][z][vx][vy][vz][x][y][z][vx][vy][vz]
When you want to add dx to all x-positions:
// AoS: stride between consecutive x values = sizeof(Particle) = 24 bytes
for (int i = 0; i < N; i++)
particles[i].x += dx;
// An AVX load at &particles[0].x picks up: x0,y0,z0,vx0,vy0,vz0,x1,y1
// ^^^ garbage fields in the register
// You wanted: x0,x1,x2,x3,x4,x5,x6,x7 — but they're 24 bytes apart, not contiguous
The x-values are scattered in memory. SIMD cannot load them efficiently — it would need gather (_mm256_i32gather_ps), which has the same throughput as scalar loads.
Struct of Arrays (SoA) — SIMD-friendly¶
Each field gets its own contiguous array. All x values together, all y values together.
struct Particles {
float x[N], y[N], z[N];
float vx[N], vy[N], vz[N];
};
Particles p;
p.x[0]=1; p.x[1]=4; p.x[2]=7; // all x values contiguous
p.y[0]=2; p.y[1]=5; p.y[2]=8;
Memory layout — each field is a flat contiguous array:
x: [x0][x1][x2][x3][x4][x5][x6][x7]... ← one AVX load = 8 x-values
y: [y0][y1][y2][y3][y4][y5][y6][y7]...
z: [z0][z1][z2][z3]...
vx: [vx0][vx1][vx2]...
Now the SIMD loop is clean — one load, one add, one store:
// SoA: x values are contiguous → direct sequential AVX load
__m256 vdx = _mm256_set1_ps(dx);
for (int i = 0; i < N; i += 8) {
__m256 vx = _mm256_loadu_ps(&p.x[i]); // load x0..x7 — perfectly contiguous
_mm256_storeu_ps(&p.x[i], _mm256_add_ps(vx, vdx)); // add dx to all 8 at once
}
Side-by-side loop comparison¶
// AoS — compiler cannot auto-vectorize the x update (stride too large)
for (int i = 0; i < N; i++)
particles[i].x += dx; // stride = 24 bytes between x values
// SoA — compiler auto-vectorizes, or you write AVX explicitly
for (int i = 0; i < N; i++)
p.x[i] += dx; // stride = 4 bytes — perfectly sequential
Layout comparison¶
| AoS | SoA | AoSoA | |
|---|---|---|---|
| Memory pattern | xyzxyzxyz... |
xxx...yyy...zzz... |
[xxxx yyyy][xxxx yyyy]... |
| SIMD efficiency | Low — gather/scatter needed | High — sequential load | High + cache-friendly |
| GPU coalescing | Poor | Good | Good |
| Code readability | High — p[i].x |
Medium — p.x[i] |
Low — tiled indexing |
| Best for | Small structs, OOP | Physics, ML kernels, SIMD | Intel oneDNN, CUTLASS tiling |
When to choose each: - AoS — small datasets, all fields used together, readability matters more than throughput - SoA — large arrays, per-field operations, SIMD loops, GPU kernels - AoSoA — when you need SoA efficiency but also want cache-line-sized tiles for L1 reuse (oneDNN weight format, CUTLASS fragment layouts)
Deep learning connection: cuDNN NCHW vs NHWC is exactly this choice. NCHW = SoA over spatial dimensions (all red pixels, then all green). NHWC = AoS over channels (all channels for one pixel together). Tensor Core layouts (e.g.,
mma.syncfragments) use AoSoA — 16×16 tiles packed for register-level reuse. Choosing the wrong layout forces expensive transposes that dominate runtime.
std::simd (C++26, Experimental)¶
The future of portable SIMD — no intrinsics, no platform-specific code:
#include <experimental/simd>
namespace stdx = std::experimental;
void add_vectors(float* a, float* b, float* c, int n) {
using V = stdx::native_simd<float>; // Auto-selects best width
for (int i = 0; i < n; i += V::size()) {
V va(&a[i], stdx::element_aligned);
V vb(&b[i], stdx::element_aligned);
V vc = va + vb;
vc.copy_to(&c[i], stdx::element_aligned);
}
}
Available in GCC 11+ with -std=c++23 -lstdc++exp. Not yet production-ready but worth tracking.
Measurement First¶
"Measure, don't guess. The bottleneck is never where you think it is."
Before writing a single intrinsic, profile to confirm what kind of bottleneck you actually have. Optimizing the wrong thing wastes time.
# perf stat — hardware counter overview
perf stat -e instructions,cycles,cache-misses,fp_arith_inst_retired.256b_packed_single \
./my_program
# Key ratios to check:
# instructions/cycles < 2.0 → probably stalled on memory
# cache-misses high → data layout problem (try SoA)
# 256b_packed_single low → auto-vectorization failed, add -fopt-info-vec
# Confirm vectorization happened
g++ -O2 -march=native -fopt-info-vec-optimized my_code.cpp
# "loop vectorized using 32 byte vectors" = success
# VTune (Intel) — visual hotspots and vectorization analysis
vtune -collect hotspots ./my_program
Roofline Model¶
The Roofline model tells you whether your kernel is compute-bound or memory-bound — and therefore whether SIMD helps.
Peak FLOPS = freq × cores × SIMD_width × FMA_factor
Peak BW = DRAM bandwidth (GB/s)
AI = FLOPs / bytes_accessed (arithmetic intensity)
If AI > Peak FLOPS / Peak BW → compute-bound (SIMD / FMA helps)
If AI < Peak FLOPS / Peak BW → memory-bound (better layout / caching helps)
Example for a modern desktop CPU:
Peak AVX2 FP32 = 3.6 GHz × 8 lanes × 2 (FMA) = ~58 GFLOPS/core
Peak DRAM BW = ~50 GB/s
Ridge point AI = 58 / 50 ≈ 1.2 FLOP/byte
Vector add: AI = 4 bytes out / 8 bytes in = 0.5 FLOP/byte → memory-bound
Dot product: AI = 2N FLOP / 2N×4 bytes = 0.25 FLOP/byte → memory-bound
Matrix mult: AI = O(N³) FLOP / O(N²) bytes → compute-bound for large N ✓
Implication: SIMD gives the full 8× speedup only on compute-bound kernels. For memory-bound kernels (like vector add), the bottleneck is DRAM bandwidth — SIMD won't help much beyond reducing loop overhead.
SIMD → GPU: Mental Model Bridge¶
The jump from SIMD to CUDA looks large but the underlying idea is the same. The key difference is who picks the element index:
// ── SIMD mindset: one thread, one instruction covers N elements ──
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
_mm256_store_ps(&c[i], _mm256_add_ps(va, vb));
}
// You write the stride (i += 8). The hardware runs 8 lanes silently.
// ── GPU / SIMT mindset: one thread per element, hardware spawns thousands ──
__global__ void add(float* a, float* b, float* c, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n) c[i] = a[i] + b[i];
}
// You write one element. The hardware runs 32 (warp) or thousands in parallel.
| Aspect | AVX (CPU SIMD) | CUDA warp (GPU SIMT) |
|---|---|---|
| Parallel width | 8× float32 | 32 threads |
| Register file | __m256 (256-bit) |
32× 32-bit registers per thread |
| Memory | L1/L2 cache | Shared memory / global memory |
| Divergence cost | Branch kills vectorization | Warp divergence serializes threads |
| Programmer model | Explicit lanes (intrinsics) | Implicit (thread index) |
Connection to the Stack¶
- L1 (Application): cuDNN and CUTLASS use vectorized memory loads internally
- L2 (Compiler): MLIR's
vectordialect (Phase 4C) targets exactly this abstraction level - L5 (Architecture): When you design an accelerator's vector unit, you're designing custom SIMD hardware
- Sub-Track 3 (CUDA): GPU "SIMT" is SIMD scaled to thousands of threads — same mental model
Performance Traps and Common Mistakes¶
These are the issues that consume weeks of debugging in real HPC and ML systems. Learn them here.
| Trap | What happens | Fix |
|---|---|---|
shared_ptr in hot loops |
Atomic ref-count inc/dec every iteration → cache line contention | Use raw pointer or unique_ptr inside hot paths; bump shared_ptr count once outside the loop |
| False sharing | Two threads update different variables in the same 64-byte cache line → invisible serialization | Pad thread-local data to alignas(64) |
Unaligned _mm256_load_ps |
Segfault (#GP) at runtime, not compile time |
Use _mm256_loadu_ps unless you've verified 32-byte alignment with alignas(32) |
| Branching inside vectorized loops | Compiler cannot vectorize; scalar fallback runs | Move conditionals outside the loop; use SIMD blend instructions (_mm256_blendv_ps) |
| Memory-bound SIMD | SIMD adds complexity, negligible speedup | Profile first (Roofline); restructure data layout (SoA) before adding intrinsics |
par_unseq with dependencies |
Silent data corruption or undefined behavior | Only use when elements are completely independent — no reads from neighboring indices |
| Ignoring NUMA | Multi-socket systems: thread allocates memory on socket 0, socket 1 thread reads it → 2× bandwidth | Allocate memory on the NUMA node that will use it (numactl, mbind) |
Projects¶
- Modern C++ warmup — Write a matrix class using
std::vector, move semantics, templates, and operator overloading. Verify that returning a large matrix from a function uses move (not copy) by checking with-fno-elide-constructors. - Lambda benchmark — Implement the same computation three ways: raw loop,
std::for_eachwith lambda, andstd::transformwith lambda. Benchmark all three with-O2. Verify the compiler generates identical assembly (-S). - Parallel STL sort — Sort 10M floats with
std::execution::seq,par, andpar_unseq. Measure speedup. Plot time vs array size. - Auto-vectorization — Write a dot product loop. Compile with
-O2 -march=native -fopt-info-vec. Check if the compiler vectorized it. If not, restructure the loop until it does. - AVX intrinsics dot product — Implement dot product using
_mm256_fmadd_ps. Benchmark against scalar and auto-vectorized versions. Measure GFLOPS. - Aligned vs unaligned — Benchmark
_mm256_load_ps(aligned) vs_mm256_loadu_ps(unaligned) for a large array. Measure the performance difference. - AoS → SoA benchmark — Implement vector dot product for an array of
Vec3structs (AoS) and compare to the SoA layout version. Observe the speedup gap (expect 10–40× from layout alone). - Horizontal sum — Implement a dot product that uses
_mm256_fmadd_psto accumulate 8 at a time, then uses the horizontal sum idiom to reduce to a scalar. Verify against scalar result. - Tiny MHA block — Implement a batched multi-head attention forward pass for seq=8, head_dim=16, heads=4 using raw AVX2 intrinsics. Pre-transpose weight matrices. Benchmark against scalar version. Observe that the speedup is ~2–3×, not 8× — explain why (memory-bound).
Resources¶
Learning¶
| Resource | What it covers |
|---|---|
| A Tour of C++ (Stroustrup) | Modern C++ overview, concise |
| cppreference.com | Definitive C++ reference |
| C++ Concurrency in Action (Williams) | Threads, atomics, memory model |
| SIMD for C++ Developers (const.me) | Best practitioner tutorial: loads, arithmetic, shuffles, masking, cross-lane gotchas. 23 pages. Read this first. |
| hands-on-simd-programming | Progressive labs from basic intrinsics → image processing → full MHA block → quantized GPT decoder. Run ./runme.sh to see benchmarks. |
| awesome-simd | Curated index of real-world SIMD libraries, tools, blogs, and references |
| Agner Fog Optimization Guides | Definitive reference on CPU micro-architecture, instruction latencies, calling conventions |
Reference and Tooling¶
| Resource | What it covers |
|---|---|
| Intel Intrinsics Guide | Searchable reference for every x86 SIMD intrinsic |
| uops.info | Instruction latency, throughput, and port usage for every CPU micro-architecture |
| Compiler Explorer (godbolt.org) | See assembly output for any C++ code — verify vectorization happened |
| SIMD-Visualiser | Browser-based visual diagram of what each intrinsic does to register contents |
| Felix Cloutier x86 reference | HTML x86/x64 instruction reference |
Portable SIMD Libraries (skip raw intrinsics)¶
| Library | Language | What it is |
|---|---|---|
| Google Highway | C++ | Best portable SIMD: length-agnostic, runtime dispatch, SSE/AVX/AVX-512/NEON/SVE |
| xsimd | C++ | Header-only wrappers for SSE/AVX/AVX-512/NEON |
| SIMDe | C/C++ | Emulates x86 intrinsics on ARM and other targets |
| Intel ISPC | ISPC | C-like language that compiles to optimal SIMD for SSE/AVX/AVX-512/NEON/PS5/Xbox |
Real-World SIMD Libraries to Study¶
| Library | What it does |
|---|---|
| simdjson | JSON parsing at >2 GB/s using SIMD — canonical example of real-world vectorized parsing |
| SimSIMD | SIMD-accelerated similarity measures (cosine, L2) for embedding vectors |
| ncnn | On-device neural network inference with hand-tuned SIMD for ARM and x86 |
| StringZilla | SIMD substring search, fuzzy matching, sorting |
Blogs (practitioners writing production SIMD code)¶
| Blog | Author |
|---|---|
| lemire.me/blog | Daniel Lemire — simdjson, SIMDComp, practical vectorization |
| branchfree.org | Geoff Langdale — branchless and SIMD techniques |
| 0x80.pl/notesen | Wojciech Muła — low-level SIMD algorithms |
Next¶
→ Sub-Track 2 — OpenMP and oneTBB — scale from SIMD to multi-core CPU parallelism.