C++ and SIMD (Phase 1 §4 — Sub-Track 1)¶

Parent: C++ and Parallel Computing

Modern C++ gives you high-level syntax with low-level performance. Master the language features first — they are the building blocks of every parallel framework that follows.

Prerequisites: Basic C programming (from Phase 1 §3 OS work). Familiarity with pointers, arrays, functions.

Why Modern C++ First¶

Every parallel computing technology in this curriculum — SIMD intrinsics, OpenMP, oneTBB, CUDA, HIP, SYCL — is built on C++. But not the C++ of the 1990s. Modern C++ (C++11 through C++17) introduced features specifically designed for performance, safety, and parallel programming:

Lambdas → used in oneTBB, SYCL, modern CUDA, parallel STL
Move semantics → zero-copy data transfer, critical for GPU memory
Templates → generic CUDA kernels, CUTLASS, CuTe
Parallel STL → built-in SIMD + threading with one line of code
constexpr → compile-time computation for HPC optimization

Learn these features now. You'll use every single one in Sub-Tracks 2–5.

Section 1: Modern C++17 for Parallel Computing¶

1.1 Type Deduction (`auto`, `decltype`)¶

auto — let the compiler figure out the type:

auto x = 10;          // int
auto y = 3.14;        // double
auto z = vec.begin(); // std::vector<int>::iterator — much cleaner

auto reduces verbosity and prevents type mismatch bugs. Use it for local variables, especially with complex types (iterators, template results).

The problem `decltype` solves¶

In a generic multiply function, what is the return type?

template<typename T, typename U>
??? multiply(T a, U b) {
    return a * b;
}
// int   * int    → int
// int   * double → double
// float * double → double

Pre-C++11 workarounds — both inadequate:

// Option 1: force a type — loses precision, not generic
template<typename T, typename U>
double multiply(T a, U b) { return a * b; }

// Option 2: std::common_type — verbose, breaks for custom operators
template<typename T, typename U>
typename std::common_type<T, U>::type multiply(T a, U b) { return a * b; }

C++11 solution — trailing return type with decltype:

template<typename T, typename U>
auto multiply(T a, U b) -> decltype(a * b) {
    return a * b;
}

The compiler deduces the exact type of the expression a * b — works for built-in types, user-defined operators, and arbitrarily complex template expressions.

C++14 simplification — decltype inferred from the return statement:

template<typename T, typename U>
auto multiply(T a, U b) {   // return type deduced automatically
    return a * b;
}

No trailing -> decltype(...) needed. The compiler infers from the return expression.

`auto` vs `decltype` — the key difference¶

auto strips references and const. decltype preserves exact type semantics.

int x = 5;
int& ref = x;

auto        a = ref;  // a is int   — reference stripped
decltype(ref) b = x;  // b is int&  — reference preserved

const int c = 10;
auto        d = c;    // d is int        — const stripped
decltype(c) e = c;    // e is const int  — const preserved

Rule of thumb: - Use auto → 90% of the time (local variables, loop iterators, lambda captures) - Use decltype → when you need exact type semantics (template return types, forwarding, trait-style code)

Comparison: old vs modern¶

	Pre-C++11	C++11 `decltype`	C++14 `auto` return
Return type deduction	Manual, error-prone	Exact expression type	Inferred from `return`
Generic correctness	Limited	Exact	Exact
Custom operator support	Hard	Yes	Yes
Readability	Verbose	Clean	Cleanest

Why it matters for parallel computing: CUTLASS, CuTe, oneAPI DPC++, and ROCm all use auto + decltype extensively because: - Kernels are templated on precision (float, half, int8_t) - Types of intermediate expressions depend on template parameters - decltype lets the library express "whatever type a*b produces" without hardcoding it

You will see patterns like decltype(auto), std::declval<T>(), and trailing return types throughout CUDA template libraries — recognizing them now saves significant confusion later.

1.2 Smart Pointers (Memory Safety via RAII)¶

The problem with raw pointers:

int* p = new int(5);
// ... 200 lines of code ...
// Did you remember to delete p? Memory leak.
// Did you delete it twice? Undefined behavior.

The solution — RAII (Resource Acquisition Is Initialization):

#include <memory>

// Exclusive ownership — automatically freed when scope ends
auto p = std::make_unique<int>(5);
// No delete needed. Ever.

// Shared ownership — freed when last reference dies
auto q = std::make_shared<int>(10);
auto r = q;  // Reference count = 2
// Freed when both q and r go out of scope

Three smart pointer types:

Type	Ownership	Overhead	Use when
`std::unique_ptr`	Exclusive (one owner)	Zero (same as raw pointer)	Default choice. Single owner.
`std::shared_ptr`	Shared (reference counted)	Atomic ref count (control block overhead, impl-dependent)	Multiple owners needed
`std::weak_ptr`	Non-owning observer	Minimal	Break circular references

Rule: Never use new/delete in modern C++. Use make_unique and make_shared.

Breaking Cycles with `std::weak_ptr`¶

shared_ptr uses reference counting. The count only reaches zero — and the memory only frees — when no shared_ptr points to the object. If two objects hold shared_ptr to each other, neither count ever reaches zero. This is a reference cycle and it silently leaks memory forever.

The cycle problem:

struct Node {
    std::shared_ptr<Node> next;   // strong reference
    int value;
};

auto a = std::make_shared<Node>();  // a: refcount = 1
auto b = std::make_shared<Node>();  // b: refcount = 1

a->next = b;   // b refcount = 2
b->next = a;   // a refcount = 2

// a and b go out of scope:
//   a's refcount drops to 1 (b->next still holds it)
//   b's refcount drops to 1 (a->next still holds it)
//   Neither reaches 0 → MEMORY LEAK

The fix — weak_ptr for the back-edge:

weak_ptr observes a shared_ptr without incrementing the reference count. The observed object can still be destroyed normally. Before using a weak_ptr, you must lock() it — this returns a shared_ptr that is either valid (object alive) or empty (object was destroyed).

struct Node {
    std::shared_ptr<Node> next;   // strong: owns the next node
    std::weak_ptr<Node>   prev;   // weak: observes without owning
    int value;
};

auto a = std::make_shared<Node>();  // a: refcount = 1
auto b = std::make_shared<Node>();  // b: refcount = 1

a->next = b;        // b refcount = 2  (strong)
b->prev = a;        // a refcount still = 1  (weak — no increment)

// a and b go out of scope:
//   a's refcount drops to 0 → destroyed → a->next released
//   b's refcount drops to 0 → destroyed
//   No leak.

// Accessing through weak_ptr — always check if still alive:
if (auto owner = b->prev.lock()) {   // lock() returns shared_ptr or nullptr
    std::cout << "prev value: " << owner->value << "\n";
} else {
    std::cout << "prev was destroyed\n";
}

When cycles appear in practice:

Pattern	Cycle edge	Fix
Doubly-linked list	`prev` pointer	`prev` as `weak_ptr`
Tree with parent pointer	`parent` pointer	`parent` as `weak_ptr`
Observer / event system	Subscriber holds reference to publisher	Subscriber stores `weak_ptr` to publisher
Computation graph (ML)	Back-edge from output node to input	Back-edges as `weak_ptr`

weak_ptr API:

auto sp = std::make_shared<int>(42);
std::weak_ptr<int> wp = sp;          // observe, no refcount increment

wp.expired();                        // true if the object was destroyed
wp.use_count();                      // same as sp.use_count() (0 if expired)
auto sp2 = wp.lock();                // returns shared_ptr<int> or nullptr — ALWAYS use this
if (sp2) { /* safe to use */ }

Do not dereference a weak_ptr directly. Always call .lock() and check the result. The object could be destroyed between expired() returning false and your next line.

Why it matters for parallel computing: - GPU memory (cudaMalloc/cudaFree) follows the same RAII pattern — wrap in a smart pointer or custom RAII class - Thread-safe reference counting (shared_ptr) is essential for multi-threaded data sharing - Computation graphs (PyTorch autograd, JAX jaxpr) are graphs with cycles — real implementations use weak back-edges to avoid memory leaks - Memory leaks in HPC = your 80 GB GPU runs out of memory mid-training

1.3 Range-Based Loops¶

std::vector<float> data = {1.0, 2.0, 3.0, 4.0};

// Modern (clean, less error-prone)
for (auto& x : data) {
    x *= 2.0f;
}

// Old style (verbose, index bugs)
for (int i = 0; i < data.size(); i++) {
    data[i] *= 2.0f;
}

Why it matters: Range-based loops work naturally with STL algorithms and parallel STL (std::execution::par). They're also easier for the compiler to auto-vectorize.

1.4 Structured Bindings (C++17)¶

Unpack tuples, pairs, and structs directly:

// Pair unpacking
std::pair<int, float> p = {1, 2.5f};
auto [id, value] = p;   // id = 1, value = 2.5f

// Map iteration (clean)
std::map<std::string, int> scores = {{"alice", 95}, {"bob", 87}};
for (auto& [name, score] : scores) {
    std::cout << name << ": " << score << "\n";
}

// Function returning multiple values
auto [min_val, max_val] = std::minmax_element(data.begin(), data.end());

Why it matters: Cleaner data handling in performance-critical code. Used in profiling results, benchmark data, multi-return-value functions.

1.5 Lambda Functions (Critical for Parallel Computing)¶

Lambdas are the single most important C++ feature for parallel programming. Every parallel framework uses them.

Syntax:

[ captures ] ( parameters ) -> return_type { body }

- captures → which variables from the outer scope to bring in - parameters → like a normal function - return_type → optional, usually inferred - body → code block

Basic lambda:

auto add = [](int a, int b) {
    return a + b;
};

int result = add(3, 4);  // 7

Capture Modes¶

Mode	Meaning	Thread-safe?	Use when
`[=]`	All outer vars by value (copy)	Yes	Default for parallel callbacks
`[&]`	All outer vars by reference	No	Single-threaded only
`[var]`	Specific var by value	Yes	Explicit, recommended
`[&var]`	Specific var by reference	No	Read-only, single-threaded
`[=, &var]`	Mixed: most by value, one by ref	Partial	Fine-grained control
`[var = std::move(obj)]`	Move object into lambda	Yes	Transfer ownership (C++14)

int multiplier = 10;

auto scale_copy     = [=](int x)       { return x * multiplier; };  // copy — safe in threads
auto scale_ref      = [&](int x)       { return x * multiplier; };  // ref — dangerous in parallel
auto scale_specific = [multiplier](int x) { return x * multiplier; };  // best practice

Mutable Lambdas¶

By default, values captured by value are const inside the lambda. Use mutable to modify the internal copy without touching the original:

int a = 10;

auto f = [a]() mutable {
    a += 5;
    std::cout << a;  // prints 15
};
f();
std::cout << a;  // still 10 — original unchanged

mutable modifies the lambda's own copy, not the outer variable.

Generic Lambdas (C++14)¶

Use auto parameters to make a lambda templated automatically — no explicit template<> needed:

auto add = [](auto a, auto b) { return a + b; };

add(3, 4);        // int + int = 7
add(3.14, 2.71);  // double + double = 5.85
add(1.0f, 2.0f);  // float + float

This is used heavily in parallel STL, oneTBB, and SYCL to write type-generic kernels.

Thread Safety Rules¶

Capture by value [x] — each thread gets its own copy. Safe.

Capture by reference [&x] — all threads share the same variable. Race condition if any thread writes to it.

std::vector<int> data = {1, 2, 3, 4};
int multiplier = 2;

// Safe: each element read is independent, multiplier captured by value
std::for_each(std::execution::par, data.begin(), data.end(),
              [multiplier](int& x) { x *= multiplier; });

// Unsafe: if multiplier were captured by ref and modified by any thread
std::for_each(std::execution::par, data.begin(), data.end(),
              [&](int& x) { x *= multiplier; });  // race condition if multiplier changes

Rule: prefer [=] or named value captures in parallel loops. Use [&] only in single-threaded code or when the captured variables are read-only.

Lambda Lifetime — Dangling Reference Trap¶

If a lambda outlives the scope of its captured references, you get a dangling reference:

std::function<int()> make_lambda() {
    int local = 42;
    return [&]() { return local; };  // DANGLING — local is destroyed when function returns
}

auto f = make_lambda();
f();  // undefined behavior

Fix 1 — capture by value:

return [local]() { return local; };  // safe copy

Fix 2 — move ownership into the lambda (C++14):

auto buf = std::make_unique<int>(42);
std::thread t([buf = std::move(buf)]() {
    std::cout << *buf;  // safe: ownership moved into lambda
});
t.join();

[buf = std::move(buf)] transfers ownership of the unique_ptr into the lambda. The lambda now owns the resource — it cannot outlive it.

Move-Into-Lambda — Deep Dive¶

This is worth understanding completely because it appears constantly in parallel and async code.

Why unique_ptr cannot be captured by value:

std::unique_ptr<int> ptr = std::make_unique<int>(42);

auto f = [ptr]() { std::cout << *ptr; };  // DOES NOT COMPILE
// unique_ptr has no copy constructor — copying would violate unique ownership

unique_ptr expresses "exactly one owner". Capture-by-value would require copying the pointer, creating two owners. The compiler forbids it.

Solution — capture by move (C++14):

auto f = [p = std::move(ptr)]() {
    std::cout << *p << "\n";   // lambda owns it, safe to use
};

// What happened to ptr?
std::cout << (ptr ? "not empty" : "empty") << "\n";  // "empty" — ptr is now nullptr
f();  // prints 42

Ownership transfer visualized:

Before move:
  ptr ──────────────► [ 42 ]   (heap)

After [p = std::move(ptr)]:
  ptr ──► nullptr
  lambda.p ─────────► [ 42 ]   (same heap allocation, new owner)

No copy. No new allocation. The lambda took the backpack — the original holder is now empty.

Thread ownership — each thread gets its own resource:

#include <thread>
#include <memory>

int main() {
    auto buf = std::make_unique<int[]>(1000);  // 4 KB buffer

    std::thread t([b = std::move(buf)]() {
        b[0] = 42;
        std::cout << b[0] << "\n";  // lambda owns the buffer exclusively
    });

    t.join();
    // buf is empty here — the thread owned and (after join) released it
}

The thread lambda owns buf exclusively. No data race. No need for a mutex. When the thread finishes, the lambda destructs and unique_ptr frees the memory automatically.

Moving other moveable types:

Any type with a move constructor works — not just unique_ptr:

std::vector<float> weights(1000000);   // 4 MB
std::string config = load_config();
auto handle = open_gpu_context();      // hypothetical RAII GPU handle

// Move all three into a thread lambda — zero copying
std::thread worker([
    w = std::move(weights),
    cfg = std::move(config),
    ctx = std::move(handle)
]() {
    // worker owns all three resources
    run_inference(w, cfg, ctx);
});
worker.join();

What you move	Why
`unique_ptr<T>`	Exclusive GPU/CPU resource handle
`vector<float>`	Large weight/activation buffer
`string`	Config or serialized model data
`std::thread`	Transfer thread ownership to lambda
Custom RAII handle	GPU context, file, socket

Rule: If a type is non-copyable (or copying is expensive), and the lambda needs to own it or outlive the current scope, use [x = std::move(obj)].

Device Lambdas (CUDA / SYCL)¶

Modern CUDA and SYCL support lambdas on GPU device code:

// CUDA — device lambda inside kernel
__global__ void kernel(float* a, float* b, int N) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < N) {
        auto square = [=](float x) { return x * x; };  // [=] only — no refs on device
        a[i] = square(b[i]);
    }
}

// SYCL — lambda IS the kernel
q.parallel_for(sycl::range{N}, [=](sycl::id<1> i) {
    C[i] = A[i] + B[i];
});

[&] is not allowed in GPU device lambdas. Device code cannot reference host memory addresses. Always use [=] (capture by value) for GPU lambdas.

Lambda Quick Reference¶

Feature	Syntax	Notes
Capture all by value	`[=]`	Safe in threads
Capture all by ref	`[&]`	Dangerous in parallel
Specific var by value	`[var]`	Best practice
Specific var by ref	`[&var]`	Single-threaded only
Generic (auto params)	`[](auto x, auto y){}`	C++14, type-generic
Mutable	`[x]() mutable {}`	Modify internal copy
Move into lambda	`[x = std::move(obj)]`	Transfer ownership, C++14
Immediately invoked	`[&]() { ... }()`	IIFE — run in-place

Where lambdas are used in this curriculum:

Framework	Lambda usage
Parallel STL	`std::sort(std::execution::par, v.begin(), v.end(), [](auto a, auto b) { return a > b; });`
oneTBB	`tbb::parallel_for(0, N, [&](int i) { C[i] = A[i] + B[i]; });`
OpenMP (C++17)	`#pragma omp parallel for` + lambda tasks: `omp_set_num_threads(N); #pragma omp task`
CUDA (modern)	Device lambdas in `__device__` context — `[=]` only
SYCL	`q.parallel_for(range, [=](id<1> i) { C[i] = A[i] + B[i]; });`

Key message: If you don't understand lambdas and captures, you cannot write parallel code in modern C++.

1.6 `std::optional`, `std::variant`, `std::any` (C++17)¶

std::optional — a value that may or may not exist:

#include <optional>

std::optional<int> find_index(const std::vector<int>& v, int target) {
    for (int i = 0; i < v.size(); i++) {
        if (v[i] == target) return i;
    }
    return std::nullopt;  // "not found" — no magic numbers like -1
}

auto result = find_index(data, 42);
if (result) {
    std::cout << "Found at index " << *result << "\n";
}

std::variant — type-safe union:

`union` vs `std::variant` — why the old way is dangerous¶

The old C union stores multiple types in the same memory, but has no idea which type is currently active:

union Data {
    int   i;
    float f;
};

Data d;
d.i = 42;
std::cout << d.f << "\n";  // undefined behavior — reading int bits as float

Memory layout of a union — 4 bytes, no type information:

+------------------+
|  42 (int bits)   |   <- what's stored
+------------------+
  ^ no type tag     <- compiler has no idea which member is active

Reading the wrong member is undefined behavior. No compiler error, no runtime check — just wrong results or crashes, often appearing only in release builds.

std::variant adds a type tag alongside the stored value:

+------------------+
|  3.14f (float)   |   <- stored data
+------------------+
  type tag: float       <- runtime always knows which type is active

#include <variant>

std::variant<int, float, std::string> v;
v = 42;         // holds int   — type tag = int
v = 3.14f;      // holds float — type tag = float
v = "hello";    // holds string — type tag = string

// std::visit dispatches to the right lambda branch based on the type tag
std::visit([](auto&& val) { std::cout << val << "\n"; }, v);

`union` vs `std::variant` comparison¶

	`union`	`std::variant`
Type tracking	None — programmer's responsibility	Runtime type tag stored alongside value
Wrong-type read	Undefined behavior, silent	`std::bad_variant_access` exception
Size overhead	Zero (just the max member size)	Max member size + small tag (usually 1–8 bytes)
Parallel safety	Dangerous — type confusion across threads	Safe — each element carries its own type info
Use in HPC	Avoid in new code	Heterogeneous arrays, command dispatch, AST nodes

`std::visit` — type-dispatching with a lambda¶

std::visit calls your lambda with the actually stored type, not a generic variant:

std::variant<int, float> v = 3.14f;

// Generic lambda — works for any type in the variant
std::visit([](auto&& x) {
    std::cout << x << "\n";                           // prints 3.14
    std::cout << typeid(x).name() << "\n";            // prints "f" (float)
}, v);

// Typed overloads using overload pattern (C++17)
std::visit(overloaded{
    [](int   x) { std::cout << "int: "   << x << "\n"; },
    [](float x) { std::cout << "float: " << x << "\n"; },
}, v);  // prints: float: 3.14

Parallel example — mixed-type array processed safely¶

#include <variant>
#include <vector>
#include <algorithm>
#include <execution>

std::vector<std::variant<int, float>> data = {1, 2.5f, 3, 4.5f};

// Double every element in parallel — type-safe, no UB
std::for_each(std::execution::par, data.begin(), data.end(),
    [](auto& val) {
        std::visit([](auto&& x) { x *= 2; }, val);
    });

// Output: 2  5  6  9
for (auto& v : data)
    std::visit([](auto&& x) { std::cout << x << " "; }, v);

Each element carries its own type tag, so parallel threads never confuse int and float even when accessing adjacent elements.

Why it matters for HPC: Compiler IR nodes, kernel dispatch tables, and mixed-precision computation graphs (FP32/FP16/INT8 hybrid inference) all need to store heterogeneous types safely. std::variant + std::visit replaces the old void*-with-a-type-enum pattern with zero undefined behavior.

1.7 Move Semantics (Performance Core)¶

The problem — expensive copies:

std::vector<float> create_large_buffer() {
    std::vector<float> buf(10000000);  // 40 MB
    // ... fill buffer ...
    return buf;  // Does this copy 40 MB? NO — move semantics!
}

Copy vs move:

std::vector<int> a = {1, 2, 3, 4, 5};

// Copy: duplicates all data (expensive)
std::vector<int> b = a;           // b is a copy, a unchanged

// Move: transfers ownership (cheap — just pointer swap)
std::vector<int> c = std::move(a); // c owns the data, a is now empty

How move works internally:

Before move:
  a → [heap: 1, 2, 3, 4, 5]    (owns the buffer)

After std::move(a) → c:
  a → nullptr                    (empty, moved-from)
  c → [heap: 1, 2, 3, 4, 5]    (now owns the buffer)

No data was copied. Just three pointer/size assignments. O(1) instead of O(n).

When to use std::move: - Returning large objects from functions (compiler often does this automatically — NRVO) - Transferring ownership of buffers to another thread - Moving data into containers: vec.push_back(std::move(large_object));

Why it matters for parallel computing: - GPU memory transfers are expensive. Move semantics = zero-copy mindset. - Thread pools move work items between threads without copying. - CUDA's cudaMemcpyAsync + pinned memory is the GPU equivalent of move semantics.

1.8 Multithreading Basics (`std::thread`, `std::mutex`)¶

Launching a thread:

#include <thread>

void compute(int id) {
    std::cout << "Thread " << id << " running\n";
}

int main() {
    std::thread t1(compute, 1);
    std::thread t2(compute, 2);

    t1.join();  // Wait for t1 to finish
    t2.join();  // Wait for t2 to finish
}

Protecting shared data with mutex:

#include <mutex>

std::mutex mtx;
int shared_counter = 0;

void increment(int n) {
    for (int i = 0; i < n; i++) {
        std::lock_guard<std::mutex> lock(mtx);  // RAII lock
        shared_counter++;
    }  // lock released here automatically
}

Better: std::scoped_lock (C++17) for multiple mutexes:

std::mutex m1, m2;
{
    std::scoped_lock lock(m1, m2);  // Locks both, prevents deadlock
    // ... critical section ...
}

std::atomic for simple shared variables:

#include <atomic>

std::atomic<int> counter{0};

void increment(int n) {
    for (int i = 0; i < n; i++) {
        counter++;  // Thread-safe, no mutex needed
    }
}

Why it matters: std::thread and std::mutex are the primitives that OpenMP and oneTBB abstract over. Understanding them helps you debug race conditions and deadlocks in any parallel framework.

1.9 STL Algorithms and Parallel STL (C++17)¶

STL algorithms are pre-built, optimized functions that operate on any container. They eliminate manual loops, prevent index bugs, and — with C++17 execution policies — run in parallel with a single extra argument.

Core STL Algorithms¶

Algorithm	What it does	Example
`std::sort`	Sort elements	`sort(v.begin(), v.end())`
`std::for_each`	Apply function to every element	`for_each(v.begin(), v.end(), f)`
`std::transform`	Apply function, write to output	`transform(a, a+n, b, out, f)`
`std::reduce`	Parallel-safe sum/product/etc.	`reduce(v.begin(), v.end())`
`std::find_if`	Search by predicate	`find_if(v.begin(), v.end(), pred)`
`std::minmax_element`	Min and max in one pass	`auto [lo, hi] = minmax_element(...)`
`std::count_if`	Count matching elements	`count_if(v.begin(), v.end(), pred)`
`std::copy_if`	Filtered copy	`copy_if(src, src+n, dst, pred)`

Sequential example — sort and iterate¶

#include <vector>
#include <algorithm>
#include <iostream>

std::vector<int> v = {4, 2, 7, 1, 5};

std::sort(v.begin(), v.end());

for (auto x : v) std::cout << x << " ";  // 1 2 4 5 7

Lambda + algorithm — custom behavior injected¶

std::vector<int> v = {1, 2, 3, 4, 5};

// Multiply every element by 2 — no loop index, no off-by-one
std::for_each(v.begin(), v.end(), [](int& x) { x *= 2; });
// v = {2, 4, 6, 8, 10}

// Transform: square each element into a new vector
std::vector<int> sq(v.size());
std::transform(v.begin(), v.end(), sq.begin(), [](int x) { return x * x; });

// Find first element > 5
auto it = std::find_if(v.begin(), v.end(), [](int x) { return x > 5; });

Parallel STL — one argument changes everything¶

C++17 execution policies parallelize any STL algorithm:

#include <algorithm>
#include <execution>
#include <vector>

std::vector<float> data(10'000'000);

// Sequential (default, single core)
std::sort(std::execution::seq,      data.begin(), data.end());

// Parallel (multi-threaded across all cores)
std::sort(std::execution::par,      data.begin(), data.end());

// Parallel + SIMD (threads + vectorization)
std::sort(std::execution::par_unseq, data.begin(), data.end());

Execution policies:

Policy	Behavior	When to use
`seq`	Sequential	Small data, debugging
`par`	Multi-threaded	Large data, independent elements
`par_unseq`	Multi-threaded + SIMD	Maximum throughput, no dependencies
`unseq` (C++20)	SIMD only (single thread)	Vectorizable, single core

Parallel transform, reduce, for_each¶

// Parallel transform: c[i] = a[i] + b[i]
std::transform(std::execution::par,
    a.begin(), a.end(), b.begin(), c.begin(),
    [](float x, float y) { return x + y; });

// Parallel reduce: sum all elements
float total = std::reduce(std::execution::par, data.begin(), data.end());

// Parallel reduce with custom op: product
float product = std::reduce(std::execution::par,
    data.begin(), data.end(), 1.0f, std::multiplies<float>{});

// Parallel for_each + SIMD: sqrt every element
std::for_each(std::execution::par_unseq, data.begin(), data.end(),
    [](float& x) { x = std::sqrt(x); });

Manual loop vs STL vs Parallel STL¶

	Manual loop	STL + lambda	Parallel STL
Code length	Long	Short	Short
Index bug risk	Medium	None	None
Parallelism	Manual (`std::thread`)	Manual	Automatic
SIMD	Compiler may vectorize	Compiler may vectorize	`par_unseq` forces it
Debugging	Hard	Easy	Harder (non-deterministic)

par_unseq requirements: Loop body must have no data races AND no loop-carried dependencies (a[i] must not read a[i-1]). Violating either produces undefined behavior — not a compile error, not even consistent wrong results.

Warning — not always faster: Parallel STL has overhead (thread pool spin-up, synchronization). For small arrays (< ~100K elements), seq is often faster. In HPC and production ML systems, teams frequently prefer OpenMP or oneTBB for more predictable, tunable, and debuggable parallelism. Benchmark before committing to par or par_unseq.

Compiler support: GCC 9+ (with -ltbb), MSVC 2017+, Intel DPC++. Clang support is improving.

1.10 Templates & Generic Programming¶

Function templates:

template<typename T>
T add(T a, T b) {
    return a + b;
}

add(3, 4);        // T = int
add(3.14, 2.71);  // T = double

Class templates:

template<typename T, int N>
struct Vector {
    T data[N];
    T& operator[](int i) { return data[i]; }
};

Vector<float, 4> v;  // 4-element float vector

Why templates matter for parallel computing: - CUDA kernels are often templated on data type and tile size: matmul<float, 32><<<grid, block>>>(...) - CUTLASS is entirely template-based — configurable GEMM dimensions, precisions, and tiling - CuTe (CUDA Template Engine) uses templates for layout and copy abstractions - Zero-cost abstraction: templates are resolved at compile time, no runtime overhead

1.11 `constexpr` (Compile-Time Computation)¶

Move computation from runtime to compile time:

constexpr int square(int x) {
    return x * x;
}

constexpr int tile_size = square(16);  // Computed at compile time = 256

// Use in array declarations, template parameters, etc.
float buffer[tile_size];  // float buffer[256] — no runtime cost

constexpr + if (C++17):

template<typename T>
void process(T value) {
    if constexpr (std::is_integral_v<T>) {
        // Integer-specific code — compiled away for floats
    } else {
        // Float-specific code — compiled away for integers
    }
}

Why it matters for HPC: - Tile sizes, block dimensions, and buffer sizes in GPU kernels are often compile-time constants - constexpr enables the compiler to optimize aggressively (unroll loops, eliminate branches) - Used extensively in CUTLASS for configurable kernel parameters

1.12 How C++ Features Map to Parallel Frameworks¶

C++ Feature	Used in	How
Lambdas	oneTBB, SYCL, Parallel STL, modern CUDA	Kernel functions, task bodies, parallel_for callbacks
Move semantics	GPU memory, thread pools	Zero-copy buffer transfer, work item passing
Templates	CUDA, CUTLASS, CuTe, CK	Configurable kernels, generic algorithms
`constexpr`	CUDA, HLS, HPC libraries	Compile-time tile sizes, buffer dimensions
`auto`	Everywhere	Complex return types, template deduction
Smart pointers	Resource management	RAII wrappers for GPU memory, file handles
Parallel STL	CPU parallelism	One-line SIMD + threading
`std::thread`/`mutex`	Manual threading	Foundation for OpenMP, oneTBB internals
`std::atomic`	Lock-free algorithms	Counters, flags in multi-threaded code
Structured bindings	Data processing	Clean iteration over results, tuples

Section 2: SIMD on CPUs¶

The gateway to GPU thinking — same operation, multiple data.

Now that you have the C++ foundation, apply it to the first level of parallelism: SIMD vector instructions.

What SIMD Is¶

CPU vector instructions that process 4, 8, 16, or 32 data elements in one instruction:

Instruction Set	Width	Data per instruction	Platform
SSE (1999)	128-bit	4x float32	x86
AVX (2011)	256-bit	8x float32	x86
AVX-512 (2017)	512-bit	16x float32	x86 (server)
ARM NEON	128-bit	4x float32	ARM (Jetson, mobile)

From SIMD to GPU — One Idea, Three Expressions¶

Every level of the parallel stack applies the same operation to multiple data elements at once. What changes is the width, the programming model, and who manages the indices.

Take the simplest possible operation: c[i] = a[i] + b[i]. Here is how you express it at each level:

Scalar (1 element at a time):
  c[0] = a[0] + b[0]
  c[1] = a[1] + b[1]   ← you write a loop, CPU executes one add per cycle
  c[2] = a[2] + b[2]
  ...

CPU SIMD / AVX (8 elements at a time):
  c[0..7] = a[0..7] + b[0..7]  ← you write i += 8, CPU runs 8 adds in one instruction
  c[8..15] = ...

GPU warp / CUDA (32 elements at a time):
  each thread handles one i      ← you write kernel for one element,
  32 threads run in lock-step       GPU runs 32 threads simultaneously (one warp)
  c[0..31] = a[0..31] + b[0..31]

GPU grid (thousands at a time):
  all threads launch at once     ← GPU runs thousands of warps across SM clusters

// ── Scalar ──────────────────────────────────────────────────────
for (int i = 0; i < n; i++)
    c[i] = a[i] + b[i];            // 1 add/cycle

// ── CPU SIMD (AVX) ───────────────────────────────────────────────
for (int i = 0; i < n; i += 8) {
    __m256 va = _mm256_loadu_ps(&a[i]);
    __m256 vb = _mm256_loadu_ps(&b[i]);
    _mm256_storeu_ps(&c[i], _mm256_add_ps(va, vb));  // 8 adds/cycle
}

// ── GPU (CUDA) ───────────────────────────────────────────────────
__global__ void add(float* a, float* b, float* c, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;  // each thread owns one i
    if (i < n) c[i] = a[i] + b[i];                  // 32 threads run this simultaneously
}

The key differences — not a new idea, just new mechanics:

	Scalar	CPU SIMD (AVX)	GPU (CUDA warp)
Width	1 element	8 floats	32 threads
Who picks `i`	Your loop counter	Your stride (`+= 8`)	Hardware (`threadIdx + blockIdx`)
Fast memory	L1/L2 cache	L1/L2 cache	Shared memory (48 KB/SM)
Slow memory	RAM	RAM	Global memory (VRAM)
Hiding latency	Out-of-order execution	ILP across loop iterations	Warp switching
Divergence cost	Branch misprediction	Breaks vectorization	Serializes warp lanes

The scaling ladder:

AVX2:       8 floats  × ~4 GHz × 2 (FMA) = ~64 GFLOPS / core
GPU warp:  32 threads × thousands of warps  → tens of TFLOPS

SIMD is the mental foundation. Once you understand why i += 8 processes 8 elements at once, the GPU model — "just make every thread a lane, and launch millions" — follows naturally.

Auto-Vectorization (Compiler Does It)¶

The easiest path — write a simple loop, let the compiler vectorize:

// This loop is auto-vectorizable
void add_vectors(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

# Check if compiler vectorized your loop
g++ -O2 -march=native -fopt-info-vec-optimized my_code.cpp
# Output: "loop vectorized using 32 byte vectors" = success

Rules for auto-vectorization: - No loop-carried dependencies (a[i] = a[i-1] + 1 cannot vectorize) - Simple data types (float, int — not complex objects) - Aligned memory helps (alignas(32) float data[1024];) - Use -O2 or -O3 with -march=native

Manual Intrinsics (Full Control)¶

When auto-vectorization fails or you need specific instructions:

#include <immintrin.h>

void add_vectors_avx(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);   // Load 8 floats from a
        __m256 vb = _mm256_load_ps(&b[i]);   // Load 8 floats from b
        __m256 vc = _mm256_add_ps(va, vb);   // Add 8 pairs at once
        _mm256_store_ps(&c[i], vc);           // Store 8 results to c
    }
}

Key intrinsics patterns:

Operation	Intrinsic	What it does
Load aligned	`_mm256_load_ps(ptr)`	Load 8 32-byte-aligned floats — crashes if misaligned
Load unaligned	`_mm256_loadu_ps(ptr)`	Load 8 floats (any alignment) — safer default
Store	`_mm256_store_ps(ptr, v)`	Store 8 floats (aligned)
Stream store	`_mm256_stream_ps(ptr, v)`	Write bypassing cache — use for large write-once buffers
Add	`_mm256_add_ps(a, b)`	8 parallel additions
Multiply	`_mm256_mul_ps(a, b)`	8 parallel multiplications
FMA	`_mm256_fmadd_ps(a, b, c)`	8 parallel `ab + c` — one instruction, not two; also more precise*
Broadcast	`_mm256_set1_ps(x)`	Fill all 8 lanes with same value
Horizontal add	`_mm256_hadd_ps(a, b)`	Add adjacent pairs within each 128-bit half
Compare	`_mm256_cmp_ps(a, b, _CMP_GT_OS)`	Per-lane compare → all-0s or all-1s mask
Blend	`_mm256_blendv_ps(a, b, mask)`	Select per-lane: `mask[i] ? b[i] : a[i]`
Permute (cross-lane)	`_mm256_permutevar8x32_ps(v, idx)`	Full cross-lane reorder by index vector
Gather	`_mm256_i32gather_ps(base, idx, 4)`	Load from non-contiguous addresses

_mm256_load_ps alignment: If ptr is not 32-byte aligned, this generates a segfault (#GP fault) at runtime — not a compile error. When in doubt, use _mm256_loadu_ps (the u = unaligned). On modern CPUs the performance difference is negligible.

FMA is the core of deep learning: _mm256_fmadd_ps(a, b, c) computes a*b + c in a single fused instruction. GEMM (matrix multiplication), convolutions, and transformer attention inner loops are entirely composed of FMA. This is why FLOPS counts are usually reported as "FLOPS of FMA." On AVX2, 8-wide FMA at 2 ops/cycle = 16 GFLOPS/GHz per core.

Aligned memory:

// Aligned allocation (required for _mm256_load_ps)
alignas(32) float data[1024];

// Or dynamic allocation
float* data = (float*)aligned_alloc(32, 1024 * sizeof(float));

Top 10 Most-Used Intrinsics in Real Codebases¶

Analysis of SIMD usage across GitHub repositories (production ML frameworks, databases, parsers) shows the same ~10 intrinsics appear in 80%+ of vectorized code. These are the ones worth memorizing first.

Rank	Intrinsic	ISA	What it does	Why it's everywhere
1	`_mm256_add_ps(a, b)`	AVX	Add 8 float32	Core of every FP loop
2	`_mm256_loadu_ps(ptr)`	AVX	Load 8 floats, unaligned	Default load — no alignment constraint
3	`_mm256_storeu_ps(ptr, v)`	AVX	Store 8 floats, unaligned	Default store
4	`_mm256_fmadd_ps(a, b, c)`	FMA+AVX2	`a*b + c` in one instruction	GEMM, convolution, attention
5	`_mm_add_ps(a, b)`	SSE	Add 4 float32 (128-bit)	SSE compat code, horizontal ops
6	`_mm_set1_epi32(x)`	SSE2	Broadcast int to all 4 lanes	Loading integer constants
7	`_mm256_cmp_ps(a, b, pred)`	AVX	Compare → bitmask	Conditional selection, NaN handling
8	`_mm_loadu_si128(ptr)`	SSE2	Load 128-bit integer vector	String/byte processing
9	`_mm_shuffle_epi32(v, imm)`	SSE2	Reorder 32-bit int lanes	Data rearrangement, AoS→SoA
10	`_mm_popcnt_u32(x)`	SSE4.2	Count set bits in 32-bit int	Database filters, Hamming distance

Patterns behind the rankings:

_ps (packed single) dominates — float32 is the workhorse of ML inference and scientific computing. Learn _ps variants first.
Unaligned loads/stores are preferred (_loadu_, _storeu_) — on Haswell and newer, the penalty for cache-line-crossing loads is ≤1 cycle. The code simplicity is worth it.
SSE2 128-bit still common — many codebases use SSE2 for compatibility or for 128-bit sub-operations (horizontal reduction, remainder loops, byte processing).
_mm_shuffle_epi32 is the workhorse for integer rearrangement — appears in virtually every AoS→SoA conversion, hash function, and string parser.
_mm_popcnt_u32/u64 is an outlier — not strictly a SIMD register instruction (operates on scalar registers), but uses the SSE4.2 ISA feature bit and dominates database and bit-manipulation code (Hamming distance, counting set bits in filters).

// The most common inner loop pattern in real ML code (ranks 2, 4, 3):
for (int i = 0; i < n; i += 8) {
    __m256 a = _mm256_loadu_ps(&A[i]);     // rank 2
    __m256 b = _mm256_loadu_ps(&B[i]);     // rank 2
    acc      = _mm256_fmadd_ps(a, b, acc); // rank 4
}
_mm256_storeu_ps(out, acc);                // rank 3

Source: Usage statistics from simd.info intrinsic statistics and analysis of GitHub repositories including production ML frameworks, databases, and codec libraries.

Intrinsics by Functional Category¶

A different lens — grouped by what you're trying to do, with both SSE (128-bit) and AVX (256-bit) versions side by side. This is how you look them up when solving a problem.

1. Data Movement & Initialization¶

// Load — unaligned is the default today (negligible penalty on modern CPUs)
__m128 a = _mm_loadu_ps(ptr);       // SSE:  4 floats
__m256 b = _mm256_loadu_ps(ptr);    // AVX:  8 floats

// Store — write register back to memory
_mm_storeu_ps(ptr, a);              // SSE
_mm256_storeu_ps(ptr, b);           // AVX

// Broadcast — fill all lanes with one scalar (scaling, bias addition)
__m128 scale = _mm_set1_ps(2.0f);        // SSE:  [2, 2, 2, 2]
__m256 scale = _mm256_set1_ps(2.0f);     // AVX:  [2, 2, 2, 2, 2, 2, 2, 2]

// Set individual lanes (note: highest lane first)
__m256 v = _mm256_set_ps(7,6,5,4,3,2,1,0);   // reverse order
__m256 v = _mm256_setr_ps(0,1,2,3,4,5,6,7);  // natural order (r = reversed param order)

2. Arithmetic¶

// Add / Subtract / Multiply / Divide
__m256 r = _mm256_add_ps(a, b);     // a[i] + b[i]
__m256 r = _mm256_sub_ps(a, b);     // a[i] - b[i]
__m256 r = _mm256_mul_ps(a, b);     // a[i] * b[i]
__m256 r = _mm256_div_ps(a, b);     // a[i] / b[i]  (slow — ~20 cycles)

// FMA — the most important arithmetic intrinsic
__m256 r = _mm256_fmadd_ps(a, b, c); // a[i]*b[i] + c[i], single instruction

// Integer arithmetic (add 8 × int32)
__m256i r = _mm256_add_epi32(a, b);
__m256i r = _mm256_mullo_epi32(a, b);  // low 32 bits of each 32×32 product

_ps = packed single (float32). _pd = packed double. _epi32 = packed int32. Learn the suffix system once and every intrinsic name becomes self-describing.

3. Comparison & Selection (Branchless Conditionals)¶

SIMD has no branches. The fundamental pattern: compare → mask → blend.

// Step 1: Compare → produces per-lane all-1s (true) or all-0s (false)
__m256 mask = _mm256_cmp_ps(a, b, _CMP_GT_OS);   // mask[i] = a[i] > b[i] ? 0xFFFFFFFF : 0

// SSE equivalent with explicit predicate:
__m128 mask = _mm_cmplt_ps(a, b);                 // mask[i] = a[i] < b[i]

// Step 2: Blend — select elements using mask
__m256 result = _mm256_blendv_ps(if_false, if_true, mask);
// result[i] = mask[i] ? if_true[i] : if_false[i]

// Practical example: clamp to [lo, hi] without any branch
__m256 clamped = _mm256_min_ps(_mm256_max_ps(v, lo), hi);

// Practical example: absolute value without branch
__m256 sign_mask = _mm256_set1_ps(-0.0f);         // sign bit only
__m256 abs_v     = _mm256_andnot_ps(sign_mask, v); // clear sign bit = |v|

Comparison predicates for _mm256_cmp_ps: _CMP_EQ_OS, _CMP_LT_OS, _CMP_LE_OS, _CMP_GT_OS, _CMP_GE_OS, _CMP_NEQ_OS, _CMP_UNORD_Q (NaN check). The _OS suffix = ordered, signaling (raises exception on NaN). Use _OQ for quiet (no exception).

4. Data Shuffling & Reduction¶

// Shuffle — rearrange within 128-bit half (compile-time imm8 control)
__m128 s = _mm_shuffle_ps(a, b, _MM_SHUFFLE(3,2,1,0));  // SSE: pick 4 elements from a and b
__m256 s = _mm256_shuffle_ps(a, b, imm8);                // AVX: operates per 128-bit half!

// Horizontal add — adds adjacent pairs (use sparingly, ~3 cycles each)
__m128 h = _mm_hadd_ps(a, b);   // [a0+a1, a2+a3, b0+b1, b2+b3]

// Better horizontal sum pattern (see "Horizontal sum" section above)
// hadd is slow — only use it 1-2 times at the very end of a reduction loop

_mm_hadd_ps warning: Horizontal add is one of the slowest SIMD instructions (~3-5 cycles vs 0.5 for _mm_add_ps). Never put it inside a loop. Use it once at the end to collapse an accumulator register — or better, use the hsum idiom from the Inspecting SIMD Registers section.

SSE vs AVX Side-by-Side Reference¶

Category	SSE (128-bit, 4× float)	AVX (256-bit, 8× float)	Purpose
Load	`_mm_loadu_ps(ptr)`	`_mm256_loadu_ps(ptr)`	Read from memory
Store	`_mm_storeu_ps(ptr, v)`	`_mm256_storeu_ps(ptr, v)`	Write to memory
Broadcast	`_mm_set1_ps(x)`	`_mm256_set1_ps(x)`	Fill all lanes with scalar
Add	`_mm_add_ps(a, b)`	`_mm256_add_ps(a, b)`	Element-wise addition
Multiply	`_mm_mul_ps(a, b)`	`_mm256_mul_ps(a, b)`	Element-wise multiply
FMA	`_mm_fmadd_ps(a, b, c)`	`_mm256_fmadd_ps(a, b, c)`	`a*b + c`
Compare	`_mm_cmplt_ps(a, b)`	`_mm256_cmp_ps(a, b, pred)`	Vectorized compare → mask
Blend	`_mm_blendv_ps(a, b, m)`	`_mm256_blendv_ps(a, b, m)`	Vectorized if/else
Shuffle	`_mm_shuffle_ps(a, b, i)`	`_mm256_shuffle_ps(a, b, i)`	Rearrange elements
Hadd	`_mm_hadd_ps(a, b)`	`_mm256_hadd_ps(a, b)`	Adjacent-pair add

When to use SSE vs AVX: - AVX for throughput — 2× the data per instruction, same latency - SSE for compatibility — SSE2 is available on every x86-64 CPU since ~2003 - SSE for remainder loops — after the main AVX loop processes n/8*8 elements, handle the last n%8 with SSE or scalar

Minimal Examples — Each Intrinsic in Isolation¶

Before combining intrinsics, see each one do exactly one thing:

#include <immintrin.h>

// ── Data Movement ──────────────────────────────────────────────────────────

// loadu: load 8 floats from any address
float data[8] = {1, 2, 3, 4, 5, 6, 7, 8};
__m256 v = _mm256_loadu_ps(data);           // v = [1, 2, 3, 4, 5, 6, 7, 8]

// set1: broadcast scalar to every lane
__m256 v_pi = _mm256_set1_ps(3.14f);        // v_pi = [3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14, 3.14]

// storeu: write register back to array
float out[8];
_mm256_storeu_ps(out, v);                   // out[] = {1, 2, 3, 4, 5, 6, 7, 8}

// ── Arithmetic ─────────────────────────────────────────────────────────────

__m256 v1 = _mm256_set1_ps(2.0f);           // [2, 2, 2, 2, 2, 2, 2, 2]
__m256 v2 = _mm256_set1_ps(3.0f);           // [3, 3, 3, 3, 3, 3, 3, 3]

__m256 sum     = _mm256_add_ps(v1, v2);     // [5, 5, 5, 5, 5, 5, 5, 5]
__m256 product = _mm256_mul_ps(v1, v2);     // [6, 6, 6, 6, 6, 6, 6, 6]
__m256 fma_res = _mm256_fmadd_ps(v1, v2, v1); // v1*v2 + v1 = [8, 8, 8, 8, 8, 8, 8, 8]

// ── Comparison & Selection (vectorized if/else) ────────────────────────────

__m256 a    = _mm256_setr_ps(1,5,3,7,2,6,4,8);
__m256 b    = _mm256_set1_ps(4.0f);
__m256 mask = _mm256_cmp_ps(a, b, _CMP_GT_OS); // mask[i] = a[i] > 4 ? 0xFFFFFFFF : 0
                                                // [0, 1, 0, 1, 0, 1, 0, 1]
__m256 result = _mm256_blendv_ps(b, a, mask);   // result[i] = mask[i] ? a[i] : b[i]
                                                // [4, 5, 4, 7, 4, 6, 4, 8]

// ── Shuffling ──────────────────────────────────────────────────────────────

__m256 sv = _mm256_setr_ps(0,1,2,3,4,5,6,7);
// _MM_SHUFFLE(d,c,b,a) → lane 0 = src[a], lane 1 = src[b], lane 2 = src[c], lane 3 = src[d]
__m256 shuffled = _mm256_shuffle_ps(sv, sv, _MM_SHUFFLE(0,1,2,3));
// Result: [3,2,1,0, 7,6,5,4] — reversed within each 128-bit half

// ── Horizontal add (slow — only at loop end) ───────────────────────────────

__m256 hv   = _mm256_setr_ps(1,2,3,4,5,6,7,8);
__m256 hsum = _mm256_hadd_ps(hv, hv);
// [1+2, 3+4, 1+2, 3+4 | 5+6, 7+8, 5+6, 7+8] = [3, 7, 3, 7, 11, 15, 11, 15]

Complete Example: Scaled Vector Addition¶

All five core intrinsics working together in a realistic loop — the pattern used in neural network activations, image processing, and signal scaling:

#include <immintrin.h>

// Computes: c[i] = (a[i] + b[i]) * scale  for all i
// Processes 8 elements per iteration using AVX
void scaled_add(const float* a, const float* b, float* c, float scale, int n) {
    // 1. set1: broadcast the scalar constant once, outside the loop
    __m256 v_scale = _mm256_set1_ps(scale);

    int i = 0;
    for (; i <= n - 8; i += 8) {
        // 2. loadu: load 8 elements from each input array
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);

        // 3. add_ps: element-wise addition
        __m256 vsum = _mm256_add_ps(va, vb);

        // 4. mul_ps: scale the result
        //    (alternatively: _mm256_fmadd_ps(va, v_scale, _mm256_mul_ps(vb, v_scale)))
        __m256 vres = _mm256_mul_ps(vsum, v_scale);

        // 5. storeu: write 8 results back to memory
        _mm256_storeu_ps(&c[i], vres);
    }

    // Scalar remainder for elements that don't fill a full AVX register
    for (; i < n; i++) {
        c[i] = (a[i] + b[i]) * scale;
    }
}

What this demonstrates: - _mm256_set1_ps outside the loop — compute constants once, reuse every iteration - _mm256_loadu_ps — no alignment constraint needed - _mm256_add_ps → _mm256_mul_ps — two arithmetic ops, 8 elements each, back-to-back - _mm256_storeu_ps — write results - Remainder loop — handle n % 8 trailing elements cleanly

FMA version — if a and b have already been scaled separately, fmadd collapses two ops:

// c[i] = a[i] * scale + b[i]
__m256 vres = _mm256_fmadd_ps(va, v_scale, vb);

Cross-Lane Limitations (The AVX2 Gotcha)¶

Most AVX2 "256-bit" instructions actually execute as two independent 128-bit halves. This is the single most surprising architectural quirk in AVX2 and causes subtle bugs when you expect full cross-lane operations.

// shuffle_ps looks like it works on 256-bit, but it only shuffles within each 128-bit half:
__m256 v = {7, 6, 5, 4, 3, 2, 1, 0};  // lanes 0-7
__m256 s = _mm256_shuffle_ps(v, v, 0b00011011);
// Result: {4,5,6,7, 0,1,2,3} — reversed within each half, NOT across the full register

// To truly cross lanes, you need explicit cross-lane moves:
// Swap the two 128-bit halves:
__m256 swapped = _mm256_permute2f128_ps(v, v, 0x01);

// Full cross-lane element reorder (AVX2) — the most flexible option:
__m256i idx = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);  // reverse order
__m256 reversed = _mm256_permutevar8x32_ps(v, idx);       // true 8-element permute

Operation	Cross-lane?	Notes
`_mm256_shuffle_ps`	No	Within each 128-bit half only
`_mm256_hadd_ps`	No	Adjacent pairs within each half
`_mm256_permute2f128_ps`	Yes	Swaps/copies 128-bit halves
`_mm256_permutevar8x32_ps`	Yes	Full 8-element permute (AVX2 only)

Rule of thumb: If an intrinsic says "256-bit" but was available in AVX (not AVX2), suspect it operates on two 128-bit halves. Verify with Intel Intrinsics Guide or uops.info.

Inspecting SIMD Registers (Debugging)¶

You can't printf a __m256. The standard pattern is a union, which lets you read individual lanes:

// Union for safe lane inspection (from hands-on-simd-programming)
union float8 {
    __m256  v;
    float   a[8];
    float8() : v(_mm256_setzero_ps()) {}
    float8(__m256 x) : v(x) {}
};

// Usage
float8 result(_mm256_fmadd_ps(va, vb, vc));
printf("lane 0: %f, lane 3: %f\n", result.a[0], result.a[3]);

// Or with structured binding in C++17:
auto [l0,l1,l2,l3,l4,l5,l6,l7] = result.a;

Horizontal sum — reducing a vector to a scalar:

This pattern appears everywhere: dot products, attention scores, reductions.

// Sum all 8 floats in a __m256 to a scalar
float hsum(__m256 v) {
    // Step 1: add pairs within each 128-bit half → [a+e, b+f, c+g, d+h | a+e, b+f, c+g, d+h]
    __m128 lo  = _mm256_castps256_ps128(v);          // lower 128 bits
    __m128 hi  = _mm256_extractf128_ps(v, 1);        // upper 128 bits
    __m128 sum = _mm_add_ps(lo, hi);                 // add the two halves
    // Step 2: horizontal adds until scalar
    sum = _mm_hadd_ps(sum, sum);                     // [a+b+e+f, c+d+g+h, ...]
    sum = _mm_hadd_ps(sum, sum);                     // [total, total, ...]
    return _mm_cvtss_f32(sum);                       // extract lane 0
}

Stream stores for large write-once buffers:

// Non-temporal store: bypasses CPU cache entirely
// Use when writing large output buffers you won't read back soon
// (e.g., preprocessing output, frame buffer writes)
for (int i = 0; i < n; i += 8) {
    __m256 result = /* compute */;
    _mm256_stream_ps(&out[i], result);   // skip cache, write direct to memory
}
_mm_sfence();  // memory fence — ensure all stream writes are visible

// When to use: write-once buffers > L3 cache size
// When NOT: if you'll read the data back soon (defeats the purpose)

Gather — loading from non-contiguous addresses:

// Gather: load from base_ptr + indices[i] * scale
float table[1024] = { /* lookup table */ };
__m256i indices = _mm256_set_epi32(7, 3, 15, 0, 42, 8, 1, 100);  // arbitrary indices
__m256 gathered = _mm256_i32gather_ps(table, indices, 4);          // scale=4 (sizeof float)

// Note: gather throughput ≈ scalar loads — no bandwidth benefit for the gather itself.
// The value: the *computation* after gather runs vectorized (8 FMAs instead of 8 scalar ops).
// Good for: embedding lookups, sparse vector ops, non-sequential access patterns.

Cache Lines and Memory Alignment¶

The CPU fetches memory in 64-byte cache lines — not individual bytes. This is the physical reason SIMD alignment matters and the root cause of most false-sharing bugs in multithreaded code.

Cache line = 64 bytes = 16 floats = 8 doubles = 2 AVX registers

[float 0][float 1]...[float 15]   ← one cache line fetch
         └─ AVX load ──┘└─ AVX load ──┘  ← if misaligned, spans two lines

Why it matters:

Scenario	Cache line impact
`_mm256_load_ps` aligned	One cache line touched per load
`_mm256_loadu_ps` misaligned	May touch two cache lines — up to 2× memory traffic
False sharing (threads)	Two threads write different variables in same 64-byte line → cache ping-pong
`alignas(64)` struct	Guarantees struct starts at cache line boundary

// False sharing — avoid this
struct Workers {
    int counter_a;  // same 64-byte line as counter_b
    int counter_b;
};

// Fix: pad to cache line boundary
struct alignas(64) WorkerA { int counter; };
struct alignas(64) WorkerB { int counter; };

Rule: Align SIMD buffers to 32 bytes (alignas(32)) for AVX. Align per-thread data to 64 bytes (alignas(64)) to prevent false sharing.

Data Layout: AoS vs SoA¶

How you arrange data in memory is often the single highest-leverage optimization in SIMD and GPU code — more impactful than instruction selection.

Array of Structs (AoS) — natural, object-oriented, SIMD-unfriendly¶

Each particle is one contiguous object. Easy to pass around, easy to reason about.

struct Particle {
    float x, y, z;    // position
    float vx, vy, vz; // velocity
};

Particle particles[4];

Memory layout — fields interleaved across all particles:

particle 0          particle 1          particle 2          particle 3
[x][y][z][vx][vy][vz][x][y][z][vx][vy][vz][x][y][z][vx][vy][vz][x][y][z][vx][vy][vz]

When you want to add dx to all x-positions:

// AoS: stride between consecutive x values = sizeof(Particle) = 24 bytes
for (int i = 0; i < N; i++)
    particles[i].x += dx;
// An AVX load at &particles[0].x picks up: x0,y0,z0,vx0,vy0,vz0,x1,y1
//                                                           ^^^ garbage fields in the register
// You wanted: x0,x1,x2,x3,x4,x5,x6,x7 — but they're 24 bytes apart, not contiguous

The x-values are scattered in memory. SIMD cannot load them efficiently — it would need gather (_mm256_i32gather_ps), which has the same throughput as scalar loads.

Struct of Arrays (SoA) — SIMD-friendly¶

Each field gets its own contiguous array. All x values together, all y values together.

struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
};

Particles p;
p.x[0]=1; p.x[1]=4; p.x[2]=7;   // all x values contiguous
p.y[0]=2; p.y[1]=5; p.y[2]=8;

Memory layout — each field is a flat contiguous array:

x:  [x0][x1][x2][x3][x4][x5][x6][x7]...  ← one AVX load = 8 x-values
y:  [y0][y1][y2][y3][y4][y5][y6][y7]...
z:  [z0][z1][z2][z3]...
vx: [vx0][vx1][vx2]...

Now the SIMD loop is clean — one load, one add, one store:

// SoA: x values are contiguous → direct sequential AVX load
__m256 vdx = _mm256_set1_ps(dx);
for (int i = 0; i < N; i += 8) {
    __m256 vx = _mm256_loadu_ps(&p.x[i]);       // load x0..x7 — perfectly contiguous
    _mm256_storeu_ps(&p.x[i], _mm256_add_ps(vx, vdx));  // add dx to all 8 at once
}

Side-by-side loop comparison¶

// AoS — compiler cannot auto-vectorize the x update (stride too large)
for (int i = 0; i < N; i++)
    particles[i].x += dx;           // stride = 24 bytes between x values

// SoA — compiler auto-vectorizes, or you write AVX explicitly
for (int i = 0; i < N; i++)
    p.x[i] += dx;                   // stride = 4 bytes — perfectly sequential

Layout comparison¶

	AoS	SoA	AoSoA
Memory pattern	`xyzxyzxyz...`	`xxx...yyy...zzz...`	`[xxxx yyyy][xxxx yyyy]...`
SIMD efficiency	Low — gather/scatter needed	High — sequential load	High + cache-friendly
GPU coalescing	Poor	Good	Good
Code readability	High — `p[i].x`	Medium — `p.x[i]`	Low — tiled indexing
Best for	Small structs, OOP	Physics, ML kernels, SIMD	Intel oneDNN, CUTLASS tiling

When to choose each: - AoS — small datasets, all fields used together, readability matters more than throughput - SoA — large arrays, per-field operations, SIMD loops, GPU kernels - AoSoA — when you need SoA efficiency but also want cache-line-sized tiles for L1 reuse (oneDNN weight format, CUTLASS fragment layouts)

Deep learning connection: cuDNN NCHW vs NHWC is exactly this choice. NCHW = SoA over spatial dimensions (all red pixels, then all green). NHWC = AoS over channels (all channels for one pixel together). Tensor Core layouts (e.g., mma.sync fragments) use AoSoA — 16×16 tiles packed for register-level reuse. Choosing the wrong layout forces expensive transposes that dominate runtime.

`std::simd` (C++26, Experimental)¶

The future of portable SIMD — no intrinsics, no platform-specific code:

#include <experimental/simd>
namespace stdx = std::experimental;

void add_vectors(float* a, float* b, float* c, int n) {
    using V = stdx::native_simd<float>;  // Auto-selects best width
    for (int i = 0; i < n; i += V::size()) {
        V va(&a[i], stdx::element_aligned);
        V vb(&b[i], stdx::element_aligned);
        V vc = va + vb;
        vc.copy_to(&c[i], stdx::element_aligned);
    }
}

Available in GCC 11+ with -std=c++23 -lstdc++exp. Not yet production-ready but worth tracking.

Measurement First¶

"Measure, don't guess. The bottleneck is never where you think it is."

Before writing a single intrinsic, profile to confirm what kind of bottleneck you actually have. Optimizing the wrong thing wastes time.

# perf stat — hardware counter overview
perf stat -e instructions,cycles,cache-misses,fp_arith_inst_retired.256b_packed_single \
    ./my_program

# Key ratios to check:
#   instructions/cycles < 2.0  → probably stalled on memory
#   cache-misses high           → data layout problem (try SoA)
#   256b_packed_single low      → auto-vectorization failed, add -fopt-info-vec

# Confirm vectorization happened
g++ -O2 -march=native -fopt-info-vec-optimized my_code.cpp
# "loop vectorized using 32 byte vectors" = success

# VTune (Intel) — visual hotspots and vectorization analysis
vtune -collect hotspots ./my_program

Roofline Model¶

The Roofline model tells you whether your kernel is compute-bound or memory-bound — and therefore whether SIMD helps.

Peak FLOPS  = freq × cores × SIMD_width × FMA_factor
Peak BW     = DRAM bandwidth (GB/s)
AI          = FLOPs / bytes_accessed  (arithmetic intensity)

If  AI > Peak FLOPS / Peak BW  → compute-bound  (SIMD / FMA helps)
If  AI < Peak FLOPS / Peak BW  → memory-bound   (better layout / caching helps)

Example for a modern desktop CPU:

Peak AVX2 FP32 = 3.6 GHz × 8 lanes × 2 (FMA) = ~58 GFLOPS/core
Peak DRAM BW   = ~50 GB/s
Ridge point AI = 58 / 50 ≈ 1.2 FLOP/byte

Vector add:  AI = 4 bytes out / 8 bytes in = 0.5 FLOP/byte → memory-bound
Dot product: AI = 2N FLOP / 2N×4 bytes    = 0.25 FLOP/byte → memory-bound
Matrix mult: AI = O(N³) FLOP / O(N²) bytes → compute-bound for large N ✓

Implication: SIMD gives the full 8× speedup only on compute-bound kernels. For memory-bound kernels (like vector add), the bottleneck is DRAM bandwidth — SIMD won't help much beyond reducing loop overhead.

SIMD → GPU: Mental Model Bridge¶

The jump from SIMD to CUDA looks large but the underlying idea is the same. The key difference is who picks the element index:

// ── SIMD mindset: one thread, one instruction covers N elements ──
for (int i = 0; i < n; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    _mm256_store_ps(&c[i], _mm256_add_ps(va, vb));
}
// You write the stride (i += 8). The hardware runs 8 lanes silently.

// ── GPU / SIMT mindset: one thread per element, hardware spawns thousands ──
__global__ void add(float* a, float* b, float* c, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) c[i] = a[i] + b[i];
}
// You write one element. The hardware runs 32 (warp) or thousands in parallel.

Aspect	AVX (CPU SIMD)	CUDA warp (GPU SIMT)
Parallel width	8× float32	32 threads
Register file	`__m256` (256-bit)	32× 32-bit registers per thread
Memory	L1/L2 cache	Shared memory / global memory
Divergence cost	Branch kills vectorization	Warp divergence serializes threads
Programmer model	Explicit lanes (intrinsics)	Implicit (thread index)

Connection to the Stack¶

L1 (Application): cuDNN and CUTLASS use vectorized memory loads internally
L2 (Compiler): MLIR's vector dialect (Phase 4C) targets exactly this abstraction level
L5 (Architecture): When you design an accelerator's vector unit, you're designing custom SIMD hardware
Sub-Track 3 (CUDA): GPU "SIMT" is SIMD scaled to thousands of threads — same mental model

Performance Traps and Common Mistakes¶

These are the issues that consume weeks of debugging in real HPC and ML systems. Learn them here.

Trap	What happens	Fix
`shared_ptr` in hot loops	Atomic ref-count inc/dec every iteration → cache line contention	Use raw pointer or `unique_ptr` inside hot paths; bump `shared_ptr` count once outside the loop
False sharing	Two threads update different variables in the same 64-byte cache line → invisible serialization	Pad thread-local data to `alignas(64)`
Unaligned `_mm256_load_ps`	Segfault (`#GP`) at runtime, not compile time	Use `_mm256_loadu_ps` unless you've verified 32-byte alignment with `alignas(32)`
Branching inside vectorized loops	Compiler cannot vectorize; scalar fallback runs	Move conditionals outside the loop; use SIMD blend instructions (`_mm256_blendv_ps`)
Memory-bound SIMD	SIMD adds complexity, negligible speedup	Profile first (Roofline); restructure data layout (SoA) before adding intrinsics
`par_unseq` with dependencies	Silent data corruption or undefined behavior	Only use when elements are completely independent — no reads from neighboring indices
Ignoring NUMA	Multi-socket systems: thread allocates memory on socket 0, socket 1 thread reads it → 2× bandwidth	Allocate memory on the NUMA node that will use it (`numactl`, `mbind`)

Projects¶

Modern C++ warmup — Write a matrix class using std::vector, move semantics, templates, and operator overloading. Verify that returning a large matrix from a function uses move (not copy) by checking with -fno-elide-constructors.
Lambda benchmark — Implement the same computation three ways: raw loop, std::for_each with lambda, and std::transform with lambda. Benchmark all three with -O2. Verify the compiler generates identical assembly (-S).
Parallel STL sort — Sort 10M floats with std::execution::seq, par, and par_unseq. Measure speedup. Plot time vs array size.
Auto-vectorization — Write a dot product loop. Compile with -O2 -march=native -fopt-info-vec. Check if the compiler vectorized it. If not, restructure the loop until it does.
AVX intrinsics dot product — Implement dot product using _mm256_fmadd_ps. Benchmark against scalar and auto-vectorized versions. Measure GFLOPS.
Aligned vs unaligned — Benchmark _mm256_load_ps (aligned) vs _mm256_loadu_ps (unaligned) for a large array. Measure the performance difference.
AoS → SoA benchmark — Implement vector dot product for an array of Vec3 structs (AoS) and compare to the SoA layout version. Observe the speedup gap (expect 10–40× from layout alone).
Horizontal sum — Implement a dot product that uses _mm256_fmadd_ps to accumulate 8 at a time, then uses the horizontal sum idiom to reduce to a scalar. Verify against scalar result.
Tiny MHA block — Implement a batched multi-head attention forward pass for seq=8, head_dim=16, heads=4 using raw AVX2 intrinsics. Pre-transpose weight matrices. Benchmark against scalar version. Observe that the speedup is ~2–3×, not 8× — explain why (memory-bound).

Resources¶

Learning¶

Resource	What it covers
A Tour of C++ (Stroustrup)	Modern C++ overview, concise
cppreference.com	Definitive C++ reference
C++ Concurrency in Action (Williams)	Threads, atomics, memory model
SIMD for C++ Developers (const.me)	Best practitioner tutorial: loads, arithmetic, shuffles, masking, cross-lane gotchas. 23 pages. Read this first.
hands-on-simd-programming	Progressive labs from basic intrinsics → image processing → full MHA block → quantized GPT decoder. Run `./runme.sh` to see benchmarks.
awesome-simd	Curated index of real-world SIMD libraries, tools, blogs, and references
Agner Fog Optimization Guides	Definitive reference on CPU micro-architecture, instruction latencies, calling conventions

Reference and Tooling¶

Resource	What it covers
Intel Intrinsics Guide	Searchable reference for every x86 SIMD intrinsic
uops.info	Instruction latency, throughput, and port usage for every CPU micro-architecture
Compiler Explorer (godbolt.org)	See assembly output for any C++ code — verify vectorization happened
SIMD-Visualiser	Browser-based visual diagram of what each intrinsic does to register contents
Felix Cloutier x86 reference	HTML x86/x64 instruction reference

Portable SIMD Libraries (skip raw intrinsics)¶

Library	Language	What it is
Google Highway	C++	Best portable SIMD: length-agnostic, runtime dispatch, SSE/AVX/AVX-512/NEON/SVE
xsimd	C++	Header-only wrappers for SSE/AVX/AVX-512/NEON
SIMDe	C/C++	Emulates x86 intrinsics on ARM and other targets
Intel ISPC	ISPC	C-like language that compiles to optimal SIMD for SSE/AVX/AVX-512/NEON/PS5/Xbox

Real-World SIMD Libraries to Study¶

Library	What it does
simdjson	JSON parsing at >2 GB/s using SIMD — canonical example of real-world vectorized parsing
SimSIMD	SIMD-accelerated similarity measures (cosine, L2) for embedding vectors
ncnn	On-device neural network inference with hand-tuned SIMD for ARM and x86
StringZilla	SIMD substring search, fuzzy matching, sorting

Blogs (practitioners writing production SIMD code)¶

Blog	Author
lemire.me/blog	Daniel Lemire — simdjson, SIMDComp, practical vectorization
branchfree.org	Geoff Langdale — branchless and SIMD techniques
0x80.pl/notesen	Wojciech Muła — low-level SIMD algorithms

Next¶

→ Sub-Track 2 — OpenMP and oneTBB — scale from SIMD to multi-core CPU parallelism.