OpenMP and oneTBB¶

Part of Phase 1 section 4 — C++ and Parallel Computing.

Goal: Shared-memory CPU parallelism with OpenMP (directive-based) and oneTBB (task-based algorithms and flow graphs) so structured parallel patterns feel familiar before CUDA.

1. Baseline: `std::thread` and Synchronization¶

Before frameworks, understand what they abstract over:

#include <thread>
#include <mutex>
#include <atomic>

// Raw thread
std::thread t([]{ /* work */ });
t.join();

// Mutex for shared state
std::mutex mtx;
std::lock_guard<std::mutex> lock(mtx);  // RAII, released on scope exit

// Atomic for simple counters (no mutex needed)
std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

Key concepts: - Data races — two threads access the same memory, at least one writes, no synchronization → undefined behavior. Not a crash, just undefined. - Lock granularity — coarse locks are safe but serialize; fine locks are fast but deadlock-prone. - Atomics — cheaper than mutexes for single-variable shared state. Same concept as CUDA's atomicAdd. - Profiling first — find hotspots with perf stat or VTune before parallelizing. The bottleneck is rarely the obvious loop.

OpenMP and oneTBB both build on these primitives. Understanding std::thread helps debug race conditions in any framework.

2. OpenMP¶

2.0 Mental Model: Fork-Join¶

OpenMP uses the fork-join model. The program starts with one thread (the master). When it hits a parallel region, it forks into a team of threads. When the region ends, all threads join back into one.

main thread ────────────────────────────────────────────────►
                    │                               │
            ┌───────┴──────────────────────────┐   │
            │    #pragma omp parallel           │   │
            │                                   │   │
thread 0 ───┤── work ── work ── work ────────── ┤───┤
thread 1 ───┤── work ── work ── work ────────── ┤───┤
thread 2 ───┤── work ── work ── work ────────── ┤───┤
thread 3 ───┤── work ── work ── work ────────── ┤───┤
            │                                   │   │
            └─────────────── implicit barrier ──┘   │
                                                     │
main thread continues ───────────────────────────────►

The compiler does this for you: #pragma omp parallel for is essentially a loop split + thread pool dispatch + barrier, all generated automatically.

Compile: g++ -fopenmp -O2 my_code.cpp

2.1 Your First Parallel Loop¶

#include <omp.h>
#include <vector>

int N = 1'000'000;
std::vector<float> a(N), b(N), c(N);

// Sequential: thread 0 does all N iterations
for (int i = 0; i < N; i++)
    c[i] = a[i] + b[i];

// Parallel: 8 threads each do ~125,000 iterations
#pragma omp parallel for
for (int i = 0; i < N; i++)
    c[i] = a[i] + b[i];

That's it. One line added. The compiler splits [0, N) into chunks, assigns each chunk to a thread, and inserts a barrier at the end.

What each thread sees:

Thread 0:  i = 0 … 124,999
Thread 1:  i = 125,000 … 249,999
Thread 2:  i = 250,000 … 374,999
...
Thread 7:  i = 875,000 … 999,999

This is safe here because each i writes to a different c[i]. No two threads touch the same memory.

2.2 The Classic Race Condition¶

Wrong — data race:

int sum = 0;

#pragma omp parallel for
for (int i = 0; i < N; i++)
    sum += a[i];   // ← RACE: multiple threads read-modify-write sum simultaneously

// sum is wrong. Possibly different on every run.

Why it breaks:

Thread 0 reads sum = 5
Thread 1 reads sum = 5     ← same value, before thread 0 wrote back
Thread 0 writes sum = 5 + a[0] = 7
Thread 1 writes sum = 5 + a[1] = 6   ← overwrites thread 0's result!

One update is lost. This is a classic read-modify-write race.

Fix: reduction clause

int sum = 0;

#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++)
    sum += a[i];   // each thread accumulates its own private sum
                   // all private sums are added together at the end

Each thread gets its own private copy of sum (initialized to 0). After the loop, OpenMP adds all private copies into the original sum. No race.

Every variable referenced inside a parallel region is either shared (one copy, all threads see it) or private (each thread has its own copy). OpenMP's default: variables declared outside the region are shared.

int x = 10;       // shared by default (dangerous if written!)
int result = 0;   // shared — we want to write it safely with reduction

#pragma omp parallel for \
    shared(a, b, c, N)   \   // explicit: shared across all threads
    private(x)            \   // each thread gets its own x (uninitialized)
    firstprivate(x)       \   // each thread gets a copy of x initialized to 10
    reduction(+:result)
for (int i = 0; i < N; i++) {
    x = compute(i);           // safe: each thread has its own x
    result += a[i] * b[i];
}

Clause	Meaning
`shared(var)`	All threads share one copy — reads OK, writes need sync
`private(var)`	Each thread gets its own copy, uninitialized
`firstprivate(var)`	Each thread gets its own copy, initialized from original
`lastprivate(var)`	Private during loop, value from last iteration copied back after
`reduction(op:var)`	Private accumulator per thread, combined with `op` at end

Rule of thumb: If all threads only read a variable → shared. If each thread writes it independently → private. If they accumulate into it → reduction.

2.4 Schedules — How Work Is Divided¶

// Static (default): divide evenly before the loop starts
// Thread 0 gets [0, N/P), thread 1 gets [N/P, 2N/P), ...
// Best when: each iteration costs the same amount of time
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) { /* uniform work */ }

// Static with chunk: interleave chunks of k iterations
// Thread 0 gets 0-7, 32-39, 64-71, ...   (chunk=8, 4 threads)
// Helps with cache locality in some patterns
#pragma omp parallel for schedule(static, 8)
for (int i = 0; i < N; i++) { /* */ }

// Dynamic: each thread takes the next k iterations when it becomes free
// Overhead: ~synchronization per chunk fetch
// Best when: iterations have unpredictable/varying cost
#pragma omp parallel for schedule(dynamic, 64)
for (int i = 0; i < N; i++) { /* variable-cost work */ }

// Guided: starts with large chunks, shrinks over time
// Reduces scheduling overhead while handling tail imbalance
// Best when: later iterations are lighter than earlier ones
#pragma omp parallel for schedule(guided)
for (int i = 0; i < N; i++) { /* */ }

// Runtime: schedule determined by OMP_SCHEDULE env variable
// OMP_SCHEDULE="dynamic,32" ./my_program
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < N; i++) { /* */ }

When to pick which:

All iterations take the same time?     → static (lowest overhead)
Iterations have wildly different cost? → dynamic
Don't know, want adaptive?             → guided
Tuning at runtime without recompile?   → runtime

2.5 Reductions¶

Supported operators out of the box:

int sum = 0, product = 1, max_val = INT_MIN;

#pragma omp parallel for reduction(+:sum) reduction(*:product) reduction(max:max_val)
for (int i = 0; i < N; i++) {
    sum      += a[i];
    product  *= b[i];
    max_val   = std::max(max_val, a[i]);
}

Built-in operators: +, *, -, &, |, ^, &&, ||, min, max.

Custom reduction (C++ only, OpenMP 4.0+):

struct Vec3 { float x, y, z; };

#pragma omp declare reduction(vec_add : Vec3 : \
    omp_out.x += omp_in.x; \
    omp_out.y += omp_in.y; \
    omp_out.z += omp_in.z) \
    initializer(omp_priv = Vec3{0,0,0})

Vec3 total{0,0,0};

#pragma omp parallel for reduction(vec_add:total)
for (int i = 0; i < N; i++)
    total += forces[i];

2.6 Collapse — Parallelizing Nested Loops¶

// Without collapse: only the outer loop is parallelized
// If outer loop has fewer iterations than threads → wasted threads
#pragma omp parallel for
for (int i = 0; i < 4; i++)
    for (int j = 0; j < 1000; j++)
        A[i][j] = B[i][j] * C[i][j];

// With collapse(2): outer * inner = 4000 iterations are parallelized
// Each thread gets a portion of the flattened 4000-iteration space
#pragma omp parallel for collapse(2)
for (int i = 0; i < 4; i++)
    for (int j = 0; j < 1000; j++)
        A[i][j] = B[i][j] * C[i][j];

Use collapse when the outer loop count is small relative to the thread count.

2.7 Critical Sections and Atomics¶

When reduction is not enough (complex shared state):

// critical: only one thread at a time
// Correct, but slow (acts like a mutex around the block)
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    #pragma omp critical
    {
        global_map[key(i)] += value(i);
    }
}

// atomic: faster for single memory operations (uses hardware atomics)
int counter = 0;
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    #pragma omp atomic
    counter++;           // ← hardware atomic, no mutex overhead
}

// atomic with operation
#pragma omp atomic update
total += a[i];

#pragma omp atomic read
int val = shared_var;

#pragma omp atomic write
shared_var = computed;

Critical vs Atomic: - #pragma omp critical — mutex, any code block, higher overhead - #pragma omp atomic — hardware instruction, single read/write/update only, much faster

2.8 Barrier and Nowait¶

#pragma omp parallel
{
    // Threads do independent work
    do_phase_one(omp_get_thread_num());

    // Implicit barrier at end of 'parallel for' — all threads wait here
    // by default before continuing

    // 'nowait' removes the barrier — threads proceed immediately when done
    #pragma omp for nowait
    for (int i = 0; i < N; i++)
        a[i] = compute(i);

    // threads that finish early do NOT wait — they proceed here
    // only safe if subsequent code doesn't depend on all threads finishing

    // Explicit barrier — synchronize all threads at a specific point
    #pragma omp barrier
    // all threads are here now
}

2.9 Tasks — Irregular Parallelism¶

Tasks are for work that doesn't fit a regular loop: recursive algorithms, tree traversal, linked lists.

// Recursive parallel tree sum using tasks
int sum_tree(Node* node) {
    if (!node) return 0;

    int left_sum, right_sum;

    #pragma omp parallel   // create thread team
    #pragma omp single     // only ONE thread creates tasks (others wait for tasks)
    {
        #pragma omp task shared(left_sum)
        left_sum = sum_tree(node->left);

        #pragma omp task shared(right_sum)
        right_sum = sum_tree(node->right);

        #pragma omp taskwait  // wait for both tasks before using results
    }

    return node->value + left_sum + right_sum;
}

Fibonacci (classic task example):

int fib(int n) {
    if (n < 2) return n;

    int x, y;
    #pragma omp task shared(x) firstprivate(n)
    x = fib(n - 1);

    #pragma omp task shared(y) firstprivate(n)
    y = fib(n - 2);

    #pragma omp taskwait
    return x + y;
}

// Call from inside a parallel region:
int result;
#pragma omp parallel
#pragma omp single
result = fib(30);

single is critical: without it, every thread would create tasks, creating an explosion. single ensures only one thread spawns the top-level tasks; the rest of the team executes them.

2.10 Sections — Different Code on Different Threads¶

#pragma omp parallel sections
{
    #pragma omp section
    {
        printf("Thread %d: loading data\n", omp_get_thread_num());
        load_data();
    }

    #pragma omp section
    {
        printf("Thread %d: initializing config\n", omp_get_thread_num());
        init_config();
    }

    #pragma omp section
    {
        printf("Thread %d: warming up cache\n", omp_get_thread_num());
        warmup();
    }
}
// All three run in parallel, all done here

Use sections for a fixed, small number of different concurrent operations. For dynamic work, use tasks.

2.11 SIMD Pragma¶

// Ask the compiler to vectorize (SIMD) one loop on a single thread
#pragma omp simd
for (int i = 0; i < N; i++)
    c[i] = a[i] * b[i] + d[i];

// Parallelize across threads AND vectorize within each thread's chunk
#pragma omp parallel for simd
for (int i = 0; i < N; i++)
    c[i] = a[i] * b[i];

// simd with reduction (e.g. sum with SIMD accumulation)
float sum = 0.0f;
#pragma omp simd reduction(+:sum)
for (int i = 0; i < N; i++)
    sum += a[i];

The simd pragma is a hint. The compiler still decides if vectorization is safe. Add -fopt-info-vec to see what was vectorized.

2.12 Thread Info and Environment¶

// Query runtime info
int n_threads = omp_get_num_threads();   // inside parallel region
int thread_id = omp_get_thread_num();    // 0 to n_threads-1
int max_threads = omp_get_max_threads(); // outside parallel region
int n_procs = omp_get_num_procs();       // hardware thread count

// Set thread count
omp_set_num_threads(8);

// Timing
double t0 = omp_get_wtime();
// ... work ...
double elapsed = omp_get_wtime() - t0;

Environment variables (set before running):

OMP_NUM_THREADS=8             # number of threads
OMP_SCHEDULE="dynamic,64"     # schedule for 'runtime' clauses
OMP_PROC_BIND=close           # bind threads to nearby hardware (NUMA)
OMP_PLACES=cores              # binding granularity: threads, cores, sockets
GOMP_SPINCOUNT=100000         # how long to spin before sleeping (GNU)

2.13 Nested Parallelism¶

omp_set_nested(1);  // enable nested parallel regions

void outer_task() {
    #pragma omp parallel for num_threads(4)
    for (int i = 0; i < 4; i++) {
        // Inner parallel region: each of 4 threads spawns 2 more
        #pragma omp parallel for num_threads(2)
        for (int j = 0; j < N; j++)
            compute(i, j);
    }
}

Nested parallelism often over-subscribes cores. Usually better to collapse(2) instead.

2.14 Common Pitfalls¶

Pitfall	Symptom	Fix
Shared write without sync	Wrong results, non-deterministic	`reduction`, `atomic`, or `critical`
Capturing `i` by reference in tasks	Task reads wrong `i`	`firstprivate(i)` on the task
`single` missing in task-creation code	Task explosion	Add `#pragma omp single` around task creation
Calling `omp_get_num_threads()` outside parallel	Returns 1 always	Call inside the parallel region
Missing `taskwait` before using task results	Use before ready	`#pragma omp taskwait`
`nowait` on dependent loops	Uses wrong data	Remove `nowait` or add explicit `barrier`
Long critical sections	Serializes threads	Narrow the critical section, use `atomic` for simple ops

2.15 OpenMP vs oneTBB¶

	OpenMP	oneTBB
API style	Compiler directives (`#pragma`)	C++ templates and lambdas
Learning curve	Lower — one pragma per feature	Higher — need to know template types
Granularity	Loop-level	Task-level
Load balancing	Static/dynamic/guided schedules	Work-stealing (automatic, adaptive)
Flow graphs	No	Yes (`flow_graph`)
Per-thread storage	`threadprivate` / `private` clause	`enumerable_thread_specific`
Nested parallelism	Manual	Built-in
SIMD hints	`#pragma omp simd`	Must use manual intrinsics
Best for	Existing loops, Fortran interop, scientific HPC	New C++ code, complex graphs, irregular tasks

Resources: OpenMP specifications · OpenMP API Reference Card (PDF)

3. oneTBB (oneAPI Threading Building Blocks)¶

3.0 Mental Model: Work-Stealing Scheduler¶

oneTBB is a task-based parallel programming library. Instead of managing threads directly, you express what can run in parallel — the runtime distributes work using a work-stealing scheduler.

How work-stealing works:

Thread 0's deque:  [task A] [task B] [task C]  ← pushes/pops from right (LIFO, cache-friendly)
Thread 1's deque:  [] ← empty, idle
Thread 2's deque:  [task X]
Thread 3's deque:  [] ← empty, idle

Step 1: Thread 0 pops task C from its own deque (right end, hot in cache)
Step 2: Thread 1 is idle → STEALS task A from Thread 0's LEFT end
Step 3: Thread 3 is idle → STEALS task X from Thread 2

Why steal from the left? - Owner pops from the right (LIFO) — processes its own tasks in stack order, cache-friendly - Stealer pops from the left (FIFO) — takes the biggest, oldest tasks, maximizing stolen work size

Result: Threads never sit idle as long as any thread has work. No need to manually tune thread counts for imbalanced workloads.

Install:

# Ubuntu/Debian
sudo apt install libtbb-dev

# CMake integration
find_package(TBB REQUIRED)
target_link_libraries(my_target TBB::tbb)

Header:

#include "oneapi/tbb.h"
using namespace oneapi::tbb;

3.1 `parallel_for` — Parallel Loop¶

The fundamental building block. Splits a range into chunks and executes each chunk on an available thread.

Simple 1D integer form:

int N = 1'000'000;
float a[N], b[N], c[N];

// Compact form — runtime picks chunk size automatically
parallel_for(0, N, [&](int i) {
    c[i] = a[i] + b[i];
});

blocked_range form — gives you the subrange, loop inside:

// Better: lets you write inner loops with fewer lambda calls
parallel_for(blocked_range<int>(0, N),
    [&](const blocked_range<int>& r) {
        for (int i = r.begin(); i < r.end(); i++) {
            c[i] = a[i] + b[i];
        }
    }
);

blocked_range<T>(begin, end) is the half-open interval [begin, end). The lambda receives a subrange r — call .begin() and .end() to iterate it.

Why blocked_range over compact form? - Avoids per-element lambda call overhead - Lets you initialize buffers once per chunk (instead of per element) - Required for custom chunk-level logic (SIMD, temp allocation)

2D range — matrix or image processing:

parallel_for(blocked_range2d<int>(0, rows, 0, cols),
    [&](const blocked_range2d<int>& r) {
        for (int i = r.rows().begin(); i < r.rows().end(); i++)
            for (int j = r.cols().begin(); j < r.cols().end(); j++)
                out[i][j] = process(in[i][j]);
    }
);

3D range:

parallel_for(blocked_range3d<int>(0, D, 0, H, 0, W),
    [&](const blocked_range3d<int>& r) {
        for (int d = r.pages().begin(); d < r.pages().end(); d++)
            for (int h = r.rows().begin(); h < r.rows().end(); h++)
                for (int w = r.cols().begin(); w < r.cols().end(); w++)
                    vol[d][h][w] = f(d, h, w);
    }
);

Partitioners — Control Chunking¶

The third optional argument controls how work is divided:

// auto_partitioner (default): runtime tunes chunk size automatically
parallel_for(blocked_range<int>(0, N), body);

// affinity_partitioner: reuses same data → same thread (cache-warm)
// Declare static so it persists between calls
static affinity_partitioner ap;
parallel_for(blocked_range<int>(0, N), body, ap);

// simple_partitioner: chunk = grainsize exactly, no dynamic balancing
parallel_for(blocked_range<int>(0, N, 1000), body, simple_partitioner());

// static_partitioner: divide evenly upfront, no stealing
// Deterministic: same thread always gets same range
parallel_for(blocked_range<int>(0, N), body, static_partitioner());

Partitioner	Chunk size	Load balancing	Use when
`auto_partitioner`	Adaptive	Work-stealing (default)	Most cases
`affinity_partitioner`	Adaptive	Dynamic + cache-aware	Re-running same loop over same data repeatedly
`simple_partitioner`	= grainsize	None	Each chunk needs temp buffer, or N is huge
`static_partitioner`	Uniform	None	Need reproducible execution for debugging

Grainsize tuning: Each chunk should take ≥ ~100,000 clock cycles (~50 µs at 2 GHz). Too-small chunks = scheduling overhead dominates. Rule: if loop body takes 10 ns, grainsize of 10,000+ is reasonable.

3.2 `parallel_reduce` — Parallel Reduction¶

For loops that accumulate a result: sum, min, max, dot product, histogram.

Lambda form (most common):

// Sum of array
float total = parallel_reduce(
    blocked_range<int>(0, N),
    0.0f,                                        // identity value
    [&](const blocked_range<int>& r, float init) -> float {
        for (int i = r.begin(); i < r.end(); i++)
            init += a[i];
        return init;
    },
    [](float x, float y) -> float {             // combine partial results
        return x + y;
    }
);

Arguments: 1. Range to iterate 2. Identity value (initial value for each partial result) 3. Body: takes a subrange + running partial → returns new partial 4. Combine: merges two partial results into one

Dot product:

float dot = parallel_reduce(
    blocked_range<int>(0, N), 0.0f,
    [&](const blocked_range<int>& r, float init) -> float {
        for (int i = r.begin(); i < r.end(); i++)
            init += a[i] * b[i];
        return init;
    },
    std::plus<float>{}
);

Find minimum with index:

struct MinResult { float val; int idx; };

auto result = parallel_reduce(
    blocked_range<int>(0, N),
    MinResult{FLT_MAX, -1},
    [&](const blocked_range<int>& r, MinResult curr) {
        for (int i = r.begin(); i < r.end(); i++)
            if (a[i] < curr.val) curr = {a[i], i};
        return curr;
    },
    [](MinResult a, MinResult b) {
        return a.val < b.val ? a : b;
    }
);

Common mistake: Do not reset the accumulator to zero inside the body — init carries prior partial results. Resetting it discards prior work.

Deterministic reduce — same result every run regardless of scheduling:

float result = parallel_deterministic_reduce(
    blocked_range<int>(0, N, 1000),  // explicit grainsize required
    0.0f,
    [&](const blocked_range<int>& r, float init) {
        return std::accumulate(&a[r.begin()], &a[r.end()], init);
    },
    std::plus<float>{}
);
// Floating-point result identical on every run

3.3 `parallel_scan` — Prefix Sum¶

Parallel prefix scan: each element out[i] = f(out[i-1], in[i]).

#include "oneapi/tbb/parallel_scan.h"

// Parallel inclusive prefix sum
std::vector<float> in(N), out(N);

parallel_scan(
    blocked_range<int>(0, N),
    0.0f,                                         // initial value
    [&](const blocked_range<int>& r, float running, bool is_final) {
        for (int i = r.begin(); i < r.end(); i++) {
            running += in[i];
            if (is_final)
                out[i] = running;
        }
        return running;
    },
    [](float a, float b) { return a + b; }       // combine
);
// out[i] = in[0] + in[1] + ... + in[i]

is_final: The scan body runs twice per chunk — once to compute partial sums (not final), once with the carry-in to fill output. is_final == true means "write to output now".

Use cases: cumulative sums, exclusive scan for compaction, CDF computation.

3.4 `parallel_sort`¶

#include "oneapi/tbb/parallel_sort.h"

std::vector<int> data(N);
// ...fill data...

// In-place sort (like std::sort but parallel)
parallel_sort(data.begin(), data.end());

// Custom comparator
parallel_sort(data.begin(), data.end(), std::greater<int>{});

// Sort struct by field
parallel_sort(records.begin(), records.end(),
    [](const Record& a, const Record& b) {
        return a.score > b.score;
    }
);

Uses a parallel quicksort variant with work-stealing. Typically 4–6× faster than std::sort on 8+ cores for large N.

Note: parallel_sort is not stable (equal elements may reorder). Use std::stable_sort for stable ordering (no TBB parallel version exists for stable sort).

3.5 `parallel_for_each` — Unknown Iteration Space¶

For containers without random-access iterators (linked lists, sets) or when the iteration space grows during execution:

std::list<WorkItem> items = get_work();

parallel_for_each(items.begin(), items.end(),
    [](WorkItem& item) {
        process(item);
    }
);

Dynamic work addition with feeder — BFS / tree traversal:

parallel_for_each(roots.begin(), roots.end(),
    [](Node* node, feeder<Node*>& f) {
        process(node);
        for (Node* child : node->children)
            f.add(child);             // adds work dynamically to the pool
    }
);
// Runs until all nodes processed, including dynamically added ones

3.6 `parallel_pipeline` — Assembly Line¶

Pipeline parallelism: data flows through stages. Parallel stages process multiple items concurrently; serial stages maintain order.

#include "oneapi/tbb/pipeline.h"

const int max_tokens = 16;   // max items in-flight simultaneously

parallel_pipeline(max_tokens,
    // Stage 1: serial input (reads one item at a time)
    make_filter<void, InputData*>(
        filter_mode::serial_in_order,
        [&](flow_control& fc) -> InputData* {
            InputData* data = read_next();
            if (!data) { fc.stop(); return nullptr; }
            return data;
        }
    ) &
    // Stage 2: parallel processing (multiple items at once)
    make_filter<InputData*, OutputData*>(
        filter_mode::parallel,
        [](InputData* in) -> OutputData* {
            return transform(in);
        }
    ) &
    // Stage 3: serial output (writes in original order)
    make_filter<OutputData*, void>(
        filter_mode::serial_in_order,
        [&](OutputData* out) {
            write_result(out);
            delete out;
        }
    )
);

Filter modes:

Mode	Ordering	Concurrency	Use when
`serial_in_order`	Preserved	1 at a time	I/O, ordered output
`serial_out_of_order`	Not preserved	1 at a time	Single-threaded transform
`parallel`	Not preserved	Multiple	CPU-bound transform, independent items

max_tokens limits memory: the pipeline never has more than this many items in-flight simultaneously. If processing is fast but output is slow, tokens pile up — limit them to bound memory.

Throughput law: Throughput = max_tokens / slowest_serial_stage_latency. A slow serial stage caps your throughput regardless of parallel stage speed.

3.7 `parallel_invoke` and `task_group` — Explicit Tasks¶

For a fixed number of independent tasks (fork-join):

#include "oneapi/tbb/parallel_invoke.h"

// Run two functions in parallel, wait for both
parallel_invoke(
    []{ sort_left_half(); },
    []{ sort_right_half(); }
);
// Both are done here

// Up to N functions
parallel_invoke(
    []{ task_a(); },
    []{ task_b(); },
    []{ task_c(); },
    []{ task_d(); }
);

For a dynamic number of tasks:

#include "oneapi/tbb/task_group.h"

task_group tg;

for (auto& item : work_units) {
    tg.run([&item]{ process(item); });
}

tg.wait();  // blocks until all tasks complete

// task_group::run() is thread-safe — tasks can add more tasks
task_group tg2;
tg2.run([&]{
    tg2.run([&]{ subtask_a(); });  // recursive is fine
    tg2.run([&]{ subtask_b(); });
});
tg2.wait();

3.8 `enumerable_thread_specific` — Per-Thread Storage¶

One of the most important oneTBB features for avoiding false sharing and locks. Gives each thread its own copy of a variable.

Problem without it:

// WRONG: all threads write to the same histogram → data race
std::vector<int> hist(256, 0);
parallel_for(0, N, [&](int i) {
    hist[image[i]]++;   // race condition!
});

Fix with enumerable_thread_specific:

#include "oneapi/tbb/enumerable_thread_specific.h"

// Each thread gets its own private histogram
enumerable_thread_specific<std::vector<int>> local_hist(
    []{ return std::vector<int>(256, 0); }  // factory: how to create each copy
);

parallel_for(0, N, [&](int i) {
    local_hist.local()[image[i]]++;   // .local() returns THIS thread's copy
});

// Combine all per-thread histograms into one
std::vector<int> global_hist(256, 0);
for (auto& h : local_hist) {          // iterate over all thread-local copies
    for (int k = 0; k < 256; k++)
        global_hist[k] += h[k];
}

Mechanics: - .local() returns a reference to this thread's copy. Creates it on first call. - Iterating over local_hist gives one copy per thread that called .local(). - Zero lock contention during the parallel phase.

Another example — thread-local sum:

enumerable_thread_specific<float> thread_sum(0.0f);

parallel_for(blocked_range<int>(0, N), [&](const blocked_range<int>& r) {
    float& my_sum = thread_sum.local();
    for (int i = r.begin(); i < r.end(); i++)
        my_sum += a[i];
});

float total = 0.0f;
for (float s : thread_sum) total += s;

Construction options:

// Default-constructed value
enumerable_thread_specific<int> ets1;          // each copy = int{}

// Value-initialized
enumerable_thread_specific<int> ets2(42);      // each copy = 42

// Factory function (for non-copyable or expensive init)
enumerable_thread_specific<std::vector<int>> ets3(
    []{ return std::vector<int>(1024, 0); }
);

3.9 `combinable<T>` — Simpler Per-Thread Accumulation¶

A simplified version of enumerable_thread_specific specifically for accumulating a single value:

#include "oneapi/tbb/combinable.h"

combinable<float> partial_sum;

parallel_for(0, N, [&](int i) {
    partial_sum.local() += a[i];   // thread-local accumulate
});

// Combine all locals with a binary op
float total = partial_sum.combine([](float a, float b) {
    return a + b;
});

vs enumerable_thread_specific: - combinable<T> — simpler API, designed specifically for reduction patterns - enumerable_thread_specific<T> — more flexible, supports iteration, factory init, non-trivial types

Use combinable when you just want a thread-local accumulator. Use enumerable_thread_specific when you need to inspect each thread's value separately after the parallel phase.

3.10 Flow Graph — Data Flow and Dependence Graphs¶

For expressing complex parallel patterns as a graph of nodes and edges. The runtime automatically runs nodes when their inputs are ready — no manual synchronization.

#include "oneapi/tbb/flow_graph.h"
using namespace oneapi::tbb::flow;

graph g;

// function_node<In, Out>: receives In, produces Out
// Second arg = max concurrency (1=serial, unlimited=max)
function_node<int, int> square(g, unlimited, [](int x) { return x * x; });
function_node<int, int> cube  (g, unlimited, [](int x) { return x * x * x; });

// join_node: wait for one input per port, emit as tuple
join_node<std::tuple<int,int>> join(g);

function_node<std::tuple<int,int>, void> printer(g, 1,
    [](const std::tuple<int,int>& t) {
        printf("square=%d  cube=%d\n", std::get<0>(t), std::get<1>(t));
    }
);

// Wire the graph
make_edge(square, input_port<0>(join));
make_edge(cube,   input_port<1>(join));
make_edge(join,   printer);

// Send inputs — both nodes run in parallel
square.try_put(5);
cube.try_put(5);

g.wait_for_all();  // always wait before graph goes out of scope

All Key Node Types¶

Node	Description
`function_node<In, Out>`	Transforms `In` → `Out`, configurable concurrency
`source_node<Out>` (deprecated) `input_node<Out>`	Generates tokens from a function; starts the graph
`broadcast_node<T>`	Fans out: sends one input to ALL successors
`join_node<tuple<...>>`	Fans in: waits for one from each port, emits tuple
`split_node<tuple<...>>`	Splits tuple: sends element N to port N
`buffer_node<T>`	Buffers messages until a successor requests them
`queue_node<T>`	FIFO buffer; passes messages to any available successor
`priority_queue_node<T>`	Priority-ordered buffer
`sequencer_node<T>`	Reorders out-of-order items back to sequence number order
`limiter_node<T>`	Limits in-flight messages (backpressure)
`overwrite_node<T>`	Stores last value; new successors get it immediately
`write_once_node<T>`	Stores first value; broadcasts to all registered receivers
`indexer_node<T...>`	Tagged union input: accepts any type from set, tags messages

Fan-Out with `broadcast_node`¶

broadcast_node<int> broadcaster(g);
function_node<int, void> worker_a(g, unlimited, [](int x){ do_a(x); });
function_node<int, void> worker_b(g, unlimited, [](int x){ do_b(x); });

make_edge(broadcaster, worker_a);
make_edge(broadcaster, worker_b);  // same input goes to both

broadcaster.try_put(42);  // both worker_a and worker_b receive 42
g.wait_for_all();

Reordering with `sequencer_node`¶

// parallel_for may complete items out of order — use sequencer to restore order
sequencer_node<Frame> reorder(g, [](const Frame& f) {
    return f.sequence_number;   // sequencer uses this to order outputs
});

function_node<Frame, Frame> processor(g, unlimited, [](Frame f) {
    f.data = process(f.data);   // runs in parallel, out of order
    return f;
});

function_node<Frame, void> writer(g, serial, [](const Frame& f) {
    write_in_order(f);          // receives frames in sequence order
});

make_edge(processor, reorder);
make_edge(reorder, writer);

Backpressure with `limiter_node`¶

// Prevent fast producer from overwhelming slow consumer
limiter_node<int> limiter(g, 8);   // max 8 messages in flight

make_edge(producer, limiter);
make_edge(limiter, slow_consumer);
make_edge(slow_consumer, limiter.decrement); // signal when done → unlocks limiter

Conditional Routing with `indexer_node`¶

using IndexerType = indexer_node<int, float>;   // accepts int OR float

IndexerType idx(g);
function_node<IndexerType::output_type, void> router(g, unlimited,
    [](const IndexerType::output_type& msg) {
        if (msg.tag() == 0)       // int arrived
            handle_int(cast_to<int>(msg));
        else                       // float arrived
            handle_float(cast_to<float>(msg));
    }
);

make_edge(idx, router);
input_port<0>(idx).try_put(42);     // send int
input_port<1>(idx).try_put(3.14f);  // send float

Always call g.wait_for_all() before the graph object is destroyed. Destroying a live graph is undefined behavior.

3.11 Concurrent Containers¶

Thread-safe containers for parallel access — no external locking needed.

`concurrent_hash_map`¶

#include "oneapi/tbb/concurrent_hash_map.h"

concurrent_hash_map<std::string, int> freq;

parallel_for_each(words.begin(), words.end(), [&](const std::string& w) {
    concurrent_hash_map<std::string, int>::accessor acc;  // write lock
    freq.insert(acc, w);   // acc locks this bucket
    acc->second++;         // safe: bucket locked by accessor
});                        // acc destructor releases the lock

// Read-only access: const_accessor (shared lock, multiple readers OK)
concurrent_hash_map<std::string, int>::const_accessor cacc;
if (freq.find(cacc, "hello"))
    printf("hello appears %d times\n", cacc->second);

accessor (write lock) and const_accessor (read lock) give per-bucket locking — much finer granularity than a single std::mutex over a std::unordered_map.

`concurrent_vector`¶

Safe for concurrent push_back and element access. Iterators and references remain valid after growth.

#include "oneapi/tbb/concurrent_vector.h"

concurrent_vector<Result> results;

parallel_for(0, N, [&](int i) {
    if (passes_filter(i))
        results.push_back(compute(i));  // thread-safe
});
// results has all passing items (order not guaranteed)

`concurrent_queue` / `concurrent_bounded_queue`¶

#include "oneapi/tbb/concurrent_queue.h"

concurrent_queue<WorkItem> queue;

// Producer thread (can be called from multiple threads)
queue.push(item);

// Consumer thread (non-blocking)
WorkItem item;
if (queue.try_pop(item)) {
    process(item);
}

// Bounded queue — push blocks when full (backpressure)
concurrent_bounded_queue<WorkItem> bounded(100);  // max 100 items
bounded.push(item);   // blocks if full
bounded.pop(item);    // blocks if empty

`concurrent_priority_queue`¶

#include "oneapi/tbb/concurrent_priority_queue.h"

concurrent_priority_queue<int> pq;

pq.push(10);
pq.push(5);
pq.push(20);

int top;
pq.try_pop(top);  // top = 20 (max by default)

3.12 Scalable Memory Allocator¶

Standard malloc/free have a single global lock — a bottleneck when many threads allocate simultaneously. TBB's allocator uses per-thread memory pools:

#include "oneapi/tbb/scalable_allocator.h"

// Use tbb_allocator with STL containers
std::vector<float, tbb::tbb_allocator<float>> vec(N);

// Use scalable_malloc directly (drop-in for malloc)
float* buf = (float*)scalable_malloc(N * sizeof(float));
scalable_free(buf);

// Replace all allocations globally (link with -ltbbmalloc_proxy)
// Or: set LD_PRELOAD=libtbbmalloc_proxy.so before running

Most impactful when many threads frequently allocate/free small objects (task objects, intermediate buffers in a pipeline).

3.13 `task_arena` — Control Thread Pool¶

task_arena lets you isolate work into a dedicated thread pool, control thread count, and bind to specific NUMA nodes.

#include "oneapi/tbb/task_arena.h"

// Create a pool of 4 threads (instead of using the global pool)
task_arena arena(4);

arena.execute([&]{
    parallel_for(0, N, [&](int i) {
        c[i] = a[i] + b[i];
    });
});
// parallel_for ran on at most 4 threads inside arena

// NUMA-aware: pin arena to socket 1
task_arena numa_arena(task_arena::constraints{}
    .set_numa_id(1)           // socket 1
    .set_max_concurrency(8)   // 8 threads
);

numa_arena.execute([&]{
    parallel_for(0, N, body);
});

Isolation: Work inside arena.execute() does not mix with work in the global arena. This is important when you have multiple independent parallel subsystems that shouldn't steal from each other.

// Two independent subsystems with separate thread pools
task_arena system_a(4), system_b(4);

// These run in their own isolated pools — no work stealing between them
std::thread t1([&]{ system_a.execute([&]{ parallel_for(0, N, body_a); }); });
std::thread t2([&]{ system_b.execute([&]{ parallel_for(0, N, body_b); }); });
t1.join(); t2.join();

3.14 `global_control` — Runtime Configuration¶

#include "oneapi/tbb/global_control.h"

// Limit maximum number of worker threads globally
global_control ctrl(
    global_control::max_allowed_parallelism, 4
);
// All TBB algorithms now use at most 4 threads
// Restored when ctrl goes out of scope (RAII)

// Stack size for worker threads
global_control stack_ctrl(
    global_control::thread_stack_size, 8 * 1024 * 1024  // 8 MB
);

3.15 Exception Handling and Cancellation¶

// Exceptions propagate from worker threads to the calling thread
try {
    parallel_for(0, N, [](int i) {
        if (bad_condition(i))
            throw std::runtime_error("bad input");
    });
} catch (const std::exception& e) {
    // Caught here — remaining iterations are cancelled
    std::cerr << e.what() << "\n";
}

// Explicit cancellation via task_group_context
task_group_context ctx;
parallel_for(blocked_range<int>(0, N), body, auto_partitioner(), ctx);

// From another thread:
ctx.cancel_group_execution();  // remaining tasks will not start

3.16 Work Isolation¶

By default, a thread waiting on an inner parallel_for may execute tasks from the outer scope. Use this_task_arena::isolate to prevent this:

this_task_arena::isolate([&]{
    parallel_for(0, N, [](int i) { /* inner work */ });
});
// Inner loop uses only threads from the current arena,
// not tasks from outer parallel regions

Use when inner and outer work share data structures that can't be accessed concurrently.

3.17 Design Patterns Summary¶

Reduce:

float total = parallel_reduce(range, 0.0f, body, std::plus<float>{});

Divide and Conquer (recursive):

void parallel_mergesort(int* a, int n) {
    if (n < THRESHOLD) { std::sort(a, a + n); return; }
    parallel_invoke(
        [&]{ parallel_mergesort(a, n/2); },
        [&]{ parallel_mergesort(a + n/2, n - n/2); }
    );
    std::inplace_merge(a, a + n/2, a + n);
}

Map (elementwise):

parallel_for(blocked_range<int>(0, N), [&](const blocked_range<int>& r) {
    for (int i = r.begin(); i < r.end(); i++)
        out[i] = f(in[i]);
});

Histogram (per-thread local + merge):

enumerable_thread_specific<std::vector<int>> local_hist(
    []{ return std::vector<int>(256, 0); });

parallel_for(0, N, [&](int i) {
    local_hist.local()[data[i]]++;
});

std::vector<int> hist(256, 0);
for (auto& h : local_hist)
    for (int k = 0; k < 256; k++) hist[k] += h[k];

Prefix scan:

parallel_scan(range, 0.0f, scan_body, std::plus<float>{});

Assembly line:

parallel_pipeline(max_tokens, input_filter & transform_filter & output_filter);

3.18 Connection to GPU Programming¶

oneTBB patterns map directly to GPU frameworks:

oneTBB concept	GPU equivalent
`parallel_for` over `blocked_range`	CUDA kernel launch over thread grid
Work-stealing scheduler	Warp scheduler (hardware)
`affinity_partitioner`	L2 cache locality hints in CUDA
`enumerable_thread_specific`	Per-warp shared memory allocation
`combinable<T>`	`atomicAdd` into shared mem, then reduce
`concurrent_queue`	CUDA stream (async task queue)
`parallel_pipeline`	CUDA multi-stream pipeline
`flow_graph` nodes	CUDA graph nodes (`cudaGraph`)
`task_group`	CUDA dynamic parallelism
`task_arena`	CUDA stream priority + stream isolation
Scalable allocator	`cudaMallocAsync` (stream-ordered pool)

Learning oneTBB's task decomposition makes CUDA's thread/block/grid hierarchy intuitive — they solve the same problem at different scales.

Resources¶

Resource	What it covers
oneTBB Developer Guide	Full tutorial: parallel_for, reduce, pipeline, flow graph, design patterns
oneTBB API Reference	Complete API
Intel oneTBB Get Started	Installation, CMake, first steps
OpenMP specifications	Official OpenMP standard
OpenMP API Quick Reference Card	All clauses at a glance
Intel Threading Building Blocks (Reinders)	Book: TBB patterns in depth

Next¶

→ CUDA and SIMT — GPU parallelism: threads, warps, blocks, and memory hierarchy.