Lecture Note 05 (L10, L11): Lock-Free Programming (RCU, Atomics) & Deadlock / Priority Inversion¶
Combines: Lecture L10 (Lock-Free: RCU, Atomics, Memory Ordering) and Lecture L11 (Deadlock, Priority Inversion, PI Mutexes).
How This Note Is Organized¶
- Part 1 — Lock-free: Why avoid locks; C11 memory model; atomics and CAS; RCU; SPSC ring buffer; hazard pointers.
- Part 2 — Deadlock & priority inversion: Coffman conditions; prevention (lock ordering, trylock); priority inversion and Mars Pathfinder; PI and priority ceiling; lockdep; watchdog.
Part 1: Lock-Free Programming — RCU, Atomics & Memory Ordering¶
Context: Locks add cost even when uncontended: cache-line bouncing, context switches, priority inversion. On hot paths (camera pipeline, sensor fusion, model config), lock-free techniques use atomics and RCU to coordinate without mutual exclusion.
Why Lock-Free?¶
- Cache-line bouncing: When multiple CPUs repeatedly acquire and release the same lock, the lock's memory location is transferred back and forth between their caches. This transfer (the "bounce") causes delays of ~100–300 cycles on NUMA systems, as each CPU must fetch the latest copy from another CPU's cache before proceeding.
- Context switches: Blocked threads pay scheduler overhead (~1–10 µs).
- Priority inversion: Low-priority holder delays high-priority waiter (see Part 2).
- Convoying: Many threads queue behind one slow holder.
Lock-free programming means coordinating between threads without using traditional locks (like mutexes). Instead, it relies on atomic hardware instructions (such as compare-and-swap, or CAS) and mechanisms like RCU. "Lock-free" doesn't mean there's no coordination—rather, the coordination is done using atomics, which generally provides more predictable and bounded costs compared to locks.
C11/C++11 Memory Model¶
Modern CPUs and compilers often rearrange (reorder) memory operations to improve performance, which can lead to one thread's updates not being visible to another thread right away, unless explicitly controlled. The C11/C++11 memory model describes how and when such reordering is allowed and gives tools (std::atomic<T> in C++ / _Atomic T in C) to specify the kind of visibility and ordering guarantees required between threads.
| Memory Order | What It Guarantees |
|---|---|
memory_order_relaxed |
Only guarantees that this operation is atomic (can't be interrupted), but makes no promises at all about the order of any memory accesses. |
memory_order_acquire |
Guarantees that all loads and stores written after this operation in your code will really happen after it (on this CPU). |
memory_order_release |
Guarantees that all loads and stores written before this operation in your code will really happen before it (on this CPU). |
memory_order_acq_rel |
Combines both acquire and release: useful for read-modify-write (like CAS), guarantees before+after ordering on this operation. |
memory_order_seq_cst |
Strongest: behaves as if there is a single total order for all such operations amongst all threads; the usual default for atomics. |
When and why to use memory_order_relaxed effectively?¶
memory_order_relaxed is used when you need atomicity (no races on a variable), but you do not need to synchronize or order memory between threads. It gives no guarantees about visibility or ordering of other memory—only that each operation is atomic (indivisible).
Effective use cases:¶
- Counters, statistics, ID generators: When multiple threads increment a shared counter, but the exact sequence or timing of increments doesn't matter, only that no increments are lost.
- Flags or progress markers that do NOT control publication of other data: For example, reporting heartbeats, progress, or status updates, where missing the very latest value is acceptable and other data does not need to be synchronized with the flag.
- Random number generators (atomic seed update): You just want atomic update, consistency doesn't depend on timing with other memory.
When NOT to use:¶
- Producer–consumer or hand-off, where one thread must see all previous writes from another before proceeding. (See earlier example —
memory_order_relaxedcannot guarantee this and can break correctness.)
Example: Safe with memory_order_relaxed (atomic counter)¶
std::atomic<uint64_t> requests_handled{0};
void worker_thread() {
// ... handle a request ...
requests_handled.fetch_add(1, std::memory_order_relaxed);
// No need to synchronize with any other memory.
}
Here, memory_order_relaxed is correct and most efficient: it prevents lost updates (atomicity), but doesn't incur expensive memory fences since ordering is irrelevant.
Key takeaway:¶
- Use
memory_order_relaxedonly when you want atomicity, not cross-thread visibility or ordering of other data. - For publishing data or signaling between threads, use
memory_order_release(producer) andmemory_order_acquire(consumer).
In summary:
memory_order_relaxedis fastest, but safe only for cases where you do not care about the order or visibility of anything except the atomic variable itself.
Acquire-release pairing (Full Example):
A release store to variable X establishes a "happens-before" relationship with a subsequent acquire load from X in another thread. This means that all memory writes performed before the release store in one thread are guaranteed to be visible to the thread that does the acquire load after observing the value.
Example: Producer–Consumer with C++11 Atomics¶
#include <atomic>
#include <cassert>
#include <thread>
#include <iostream>
std::atomic<int> flag{0};
int data = 0;
void producer() {
data = 42; // (1) Store data first
flag.store(1, std::memory_order_release); // (2) Signal with release store
}
void consumer() {
while (flag.load(std::memory_order_acquire) != 1) {
// spin/wait until producer signals
}
// (3) After acquire load observes "1" in flag, all writes before the release are visible
assert(data == 42); // Always succeeds!
std::cout << "Consumer sees data: " << data << std::endl;
}
int main() {
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
return 0;
}
What this demonstrates:
- The producer writes to
data, then does aflag.store(..., memory_order_release) - The consumer spins until it sees
flag == 1viaflag.load(..., memory_order_acquire) - After seeing the flag set, the consumer is guaranteed to see
data == 42, as all writes before the release store are visible after the acquire load across threads.
Common pitfall: Using
memory_order_relaxedfor a "data ready" flag only guarantees atomicity of the flag — it does not guarantee that the payload written before the flag store is visible to the reader. Always use release (producer) / acquire (consumer) for producer-consumer signaling.
Example: memory_order_acq_rel with compare-exchange¶
Read-modify-write operations (e.g. CAS) need to both publish the new value to other threads and observe prior writes. memory_order_acq_rel does both on the same operation: on success, the CAS acts as acquire (sees the latest state) and release (makes the update visible). Use it when the atomic is the single synchronization point for a shared structure (e.g. lock-free counter or stack pointer).
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
std::atomic<int> counter{0};
void increment() {
int expected = counter.load(std::memory_order_relaxed);
while (!counter.compare_exchange_strong(
expected, expected + 1,
std::memory_order_acq_rel, // success: publish our write + see others'
std::memory_order_acquire)) // failure: only acquire (expected gets current)
{
// expected was updated; retry with new value
}
}
int main() {
const int num_threads = 4;
const int increments_per_thread = 100000;
std::vector<std::thread> threads;
for (int i = 0; i < num_threads; ++i) {
threads.emplace_back([&]() {
for (int j = 0; j < increments_per_thread; ++j) {
increment();
}
});
}
for (auto& t : threads)
t.join();
std::cout << "counter = " << counter.load() << " (expected "
<< num_threads * increments_per_thread << ")\n";
return 0;
}
How the CAS loop works:
**compare_exchange_strong(expected, desired, success_order, failure_order)** does one atomic step:- If
counter’s current value equalsexpected: storedesired(i.e.expected + 1) intocounterand returntrue. - Else: leave
counterunchanged, write the current value ofcounterintoexpected, and returnfalse. **while (! ...)**means: keep retrying until the exchange succeeds. On each failure, another thread changedcounterbefore we did, and the CPU already put that new value intoexpected, so the next iteration uses an up-to-dateexpected(we don’t need an extra load).- Success order
memory_order_acq_rel: When we successfully storeexpected + 1, we both release (so our write is visible to others) and acquire (so we see all prior writes by other threads). That keeps the counter consistent with any other shared state tied to it. - Failure order
memory_order_acquire: When the compare fails, we don’t modifycounter; we only read it. Acquire is enough so we see the latest value (andexpectedis updated). We don’t need release on failure because we didn’t write.
Without the loop, a single CAS could fail (e.g. another thread incremented between our load and our CAS), and we would have skipped an increment. The loop retries until our increment is the one that “wins.”
Example: memory_order_seq_cst — single total order¶
memory_order_seq_cst gives a single total order of all seq_cst operations across all threads. Every thread agrees on the order of seq_cst loads and stores. This is useful when you want a simple, globally consistent view of a small coordination point. It is the default for std::atomic operations when you don’t specify an order.
#include <atomic>
#include <iostream>
#include <thread>
std::atomic<bool> stop{false};
std::atomic<int> next_id{0};
void worker() {
while (!stop.load(std::memory_order_seq_cst)) {
// do work
}
}
void request_stop() {
stop.store(true, std::memory_order_seq_cst);
}
int allocate_id() {
return next_id.fetch_add(1, std::memory_order_seq_cst);
}
int main() {
std::thread t1(worker);
std::thread t2(worker);
std::cout << "Allocated id: " << allocate_id() << "\n";
request_stop();
t1.join();
t2.join();
return 0;
}
What this shows:
stopis a simple shutdown flag that many worker threads can check.request_stop()sets the flag once, and every worker eventually sees it and exits.allocate_id()usesfetch_add(1)to hand out a strictly increasing ID across threads.
Real-world example: imagine a camera or web server processing many tasks at once. Each incoming request or frame needs a unique ID so logs, traces, and results can be matched later. allocate_id() acts like a ticket dispenser:
- thread A gets ticket 0
- thread B gets ticket 1
- thread C gets ticket 2
Even if several threads call it at the same time, no two of them get the same number. That is much safer than doing id = next_id; next_id = next_id + 1;, which can race and produce duplicates.
Real-world shutdown example: in a blockchain miner, the stop flag tells all mining threads to stop cleanly when there is no more PoW work to do:
std::atomic<bool> stop{false};
void worker() {
while (!stop.load(std::memory_order_seq_cst)) {
// try a nonce, test PoW, submit a share if found
}
}
void miner_manager() {
// wait for new block template / work
while (!stop.load(std::memory_order_seq_cst)) {
// dispatch work to miners
// if no new work is available and mining should pause/stop:
// stop.store(true, std::memory_order_seq_cst);
}
}
void request_stop() {
stop.store(true, std::memory_order_seq_cst);
}
stop == falsemeans miners should keep searching nonces.request_stop()sets the flag once when the manager decides to stop, pause, or exit.- Each worker checks the flag in its loop, finishes the current attempt, and exits cleanly.
This avoids wasting CPU on useless hashes, leaving threads running after the pool is empty, or hard-killing workers in the middle of shared-state updates. So yes, one manager thread can set a shared atomic flag, and all mining threads will stop on the next check.
Why seq_cst fits here:
- It is easy to reason about.
- Every thread sees one global order of the atomic operations.
- It avoids subtle bugs when the flag is part of a larger concurrent protocol.
In real systems, seq_cst is often chosen for:
- shutdown flags
- “ready” signals
- global counters
- debug builds and correctness-first code
- small coordination points where performance is not the main concern
If you need more performance and can tolerate a weaker guarantee, acquire/release is often enough for a shutdown flag. But for simple control paths, seq_cst is the safest and clearest choice.
Hardware Memory Models & Kernel Barriers¶
What does this mean?
Different CPUs can observe memory operations in different orders unless the programmer uses explicit controls to enforce order. This is called the hardware memory model. On some CPUs (like x86-64), memory accesses appear mostly in the order written in the program. On others (like ARM64 and RISC-V), they can be freely reordered for performance unless you add explicit synchronization.
Because of these differences, in kernel (or low-level) code you often need to insert memory barriers: special instructions that tell the CPU, "do not reorder memory accesses across this point." These can be explicit (special instructions) or implicit (using C11 atomics with specific orderings).
| Architecture | Memory model | What this means / Barrier requirements |
|---|---|---|
| x86-64 | TSO (Total Store Order) — strong | Almost always keeps stores/loads in order; few explicit barriers needed. Atomic operations use LOCK prefix. |
| ARM64 | Weakly ordered | Loads/stores can reorder; you must use barriers: DMB/DSB (memory barrier instructions), and LDAR/STLR (acquire/release forms of load/store) for correct synchronization. |
| RISC-V | Weak (RVWMO) | Also can reorder; synchronize with FENCE instructions and use LR/SC for LL/SC-style atomics. |
Linux kernel barrier macros: (portable APIs that expand to the right instructions for the architecture)
smp_mb()— Memory Barrier: Ensures both loads and stores before the barrier are globally observed before any after it.smp_rmb()— Read (load) Memory Barrier: Prevent reordering of loads across the barrier.smp_wmb()— Write (store) Memory Barrier: Prevent reordering of stores across the barrier.
Practical example/meaning:
On ARM64 (e.g. Jetson Orin), even "simple" multithreaded code that works on x86 (because of its strong memory ordering) can break unless you use atomics/barriers — the CPU might reorder updates and readers could see stale or inconsistent data. That's why kernel (and C11/C++11) code needs explicit atomic operations or memory barriers for shared variables: to guarantee all threads see updates in the intended order and prevent bugs that only show up on weakly ordered chips.
Example: Why memory barriers are needed
Suppose you have a simple producer/consumer flag protocol:
// Producer
data = 123;
flag = 1;
// Consumer
if (flag == 1) {
assert(data == 123); // Could fail!
}
On x86, this almost always works as expected. But on ARM64 or RISC-V, the CPU may reorder the store to flag before data, so the consumer could see flag == 1 but data still having an old or garbage value.
Fix with barriers:
// Producer (ARM64-style)
data = 123;
smp_wmb(); // Ensure data is visible before flag
flag = 1;
// Consumer
if (flag == 1) {
smp_rmb(); // Ensure load of data happens after seeing the flag
assert(data == 123); // Now guaranteed
}
By adding smp_wmb() before setting the flag (producer) and smp_rmb() after checking the flag (consumer), you prevent reordering and ensure correct visibility of data when flag indicates it is ready.
Atomic Operations (Kernel)¶
Map to single indivisible hardware instructions: LOCK ADD/XADD/CMPXCHG on x86, LDADD/CAS on ARMv8.1 LSE.
atomic_t counter = ATOMIC_INIT(0);
atomic_inc(&counter); // LOCK XADD / LDADD
atomic_dec_and_test(&counter); // decrement; return true if zero (refcounts)
int old = atomic_cmpxchg(&counter, 5, 10); // if counter==5 set to 10; return old value
atomic_add_return(n, &counter); // add n, return new value (rate limiting)
atomic_tis 32-bit;atomic64_tfor 64-bit. Use for: reference counts, flags, per-CPU statistics.
Compare-and-Swap (CAS): Meaning¶
Meaning:
Compare-and-Swap (CAS) is a hardware-supported atomic instruction that checks if a memory location contains an expected value, and if so, updates it to a new value—all in one indivisible step. This is the basic building block for implementing data structures and algorithms that allow multiple threads to coordinate without using locks. With CAS, threads can safely update shared data despite running at the same time, because the operation guarantees that no two threads can both succeed at the same update.
Practical use:
You use CAS to implement things like lock-free queues, stacks, reference counters, and other algorithms where multiple threads might try to update the same value at once. If two threads race to update a value, only one will succeed. The other will see that it lost the race and can try again.
Code example:
std::atomic<int> val{0};
int expected = 0;
bool ok = val.compare_exchange_strong(expected, 1,
std::memory_order_acq_rel, // On success: this thread sees all prior writes/updates, and its own write is visible to others.
std::memory_order_acquire); // On failure: just read latest, no write.
- If
valcurrently equalsexpected(0), set it to 1, andokis true. - If not,
valis unchanged,expectedis updated to whatever the value was,okis false, and you typically retry. - Inside a loop, you usually use
compare_exchange_weak(can give false negatives on some hardware but may be faster).
Pitfall — the ABA problem (meaning):
Suppose a pointer's value is A, then it gets changed to B, then back to A (but now A points to a different object). CAS only checks for "A", and if it sees A, it assumes nothing changed—but in reality, the object at A may have been deleted and reused, causing a use-after-free bug.
Meaning of ABA Solutions:
- Version tags: Attach a counter to the pointer, so any change increments the count (need double-wide atomic compare for this, e.g. 128-bit CAS).
- Hazard pointers: Readers advertise "I'm using this pointer"; writers only free memory after no reader is using it.
- RCU: Updates ensure old values are not freed until all prior readers have finished, preventing the bug.
Example — how version tags fix ABA:
Store both the pointer and a version number together. If the pointer changes away from A and later comes back to A, the version will still be different, so CAS will fail instead of being fooled by the same address.
struct TaggedPtr {
Node* ptr;
uint64_t version;
};
std::atomic<TaggedPtr> head;
void update_head(Node* old_ptr, Node* new_ptr) {
TaggedPtr expected = {old_ptr, old_ptr ? old_ptr->version : 0};
TaggedPtr desired = {new_ptr, expected.version + 1};
// Fails if either the pointer OR the version changed.
head.compare_exchange_strong(expected, desired,
std::memory_order_acq_rel,
std::memory_order_acquire);
}
If another thread changes head from A to B and then back to A, the pointer value may look the same, but the version will be higher. That tells us the object was modified in between, so the stale CAS does not succeed.
In summary:
CAS is "atomic if equal, else update and retry", forming the foundation of safe, efficient lock-free programming by providing a way for threads to agree on shared memory changes without locks—just using a built-in check-and-swap step that never halfway updates a value.
SPSC Ring Buffer (Lock-Free): In-Depth with MengRao's SPSC_Queue¶
The lock-free SPSC ("Single Producer, Single Consumer") queue is an efficient queue for communication between exactly one producer thread and one consumer thread. Let's break down how it works, inspired by and referencing the widely respected MengRao/SPSC_Queue implementation, which is a gold standard for high-performance SPSC queues.
Key Properties¶
- Lock-Free: No thread ever blocks or waits; all queue operations are non-blocking and progress independently.
- No False Sharing: Producer and consumer only touch separate synchronization variables, reducing cache line bouncing.
- Cache Efficiency: Payload buffer is never "locked" by both sides at once, so high throughput is possible (critical for high-rate pipelines like camera frames or audio).
Basic Structure¶
MengRao's queue, and most fast SPSC queues, use the following design:
- Buffer: A fixed-size circular buffer (array), where all slots are preallocated and accessed modulo the buffer size (which is a power of 2).
- Separate Indices: Two indices, one for the producer (
tailorwrite_idx) and one for the consumer (headorread_idx), each owned and only updated by one thread. - Fences: Use of proper memory orderings (
release,acquire) ensures that data written by the producer is visible to the consumer only after the consumer observes the index update, and vice versa.
Reference Structure — MengRao SPSC_Queue¶
Here is an annotated version, blending the original style with C/C++ conventions (for full detail see MengRao/SPSC_Queue/queue_spsc.h):
template<typename T, size_t N = 256>
class SPSCQueue {
alignas(64) T buf[N]; // Circular buffer for data; cache-line aligned to avoid false sharing
alignas(64) std::atomic<size_t> head = 0; // Read index — owned only by consumer
alignas(64) std::atomic<size_t> tail = 0; // Write index — owned only by producer
public:
// Called only from the producer thread
bool enqueue(const T &item) {
size_t tail_cache = tail.load(std::memory_order_relaxed);
size_t head_cache = head.load(std::memory_order_acquire); // see what consumer has consumed
if (((tail_cache + 1) & (N - 1)) == (head_cache & (N - 1))) {
// Buffer is full (one slot left empty to avoid overwrite ambiguity)
return false;
}
buf[tail_cache & (N - 1)] = item;
tail.store(tail_cache + 1, std::memory_order_release); // publish
return true;
}
// Called only from the consumer thread
bool dequeue(T &item) {
size_t head_cache = head.load(std::memory_order_relaxed);
size_t tail_cache = tail.load(std::memory_order_acquire); // see what producer has added
if ((head_cache & (N - 1)) == (tail_cache & (N - 1))) {
// Buffer is empty
return false;
}
item = buf[head_cache & (N - 1)];
head.store(head_cache + 1, std::memory_order_release);
return true;
}
};
Key Points:¶
- Single Ownership Principle: Only the consumer thread changes
head, and only the producer changestail. Each thread reads (but does not write) the other's index. - Wrap-around: Buffer size is a power of two — wrapping is done with a bitmask (
& (N - 1)) for speed. - Full vs. Empty: We keep one slot empty intentionally, so
tail + 1 == headmeans full,tail == headmeans empty (this avoids ambiguity). - Memory Barriers:
store(..., memory_order_release)ensures that all prior stores to the buffer are visible before publishing the index.load(..., memory_order_acquire)ensures that the subsequent buffer reads see the latest data.
Why Is This Better Than Locks or Naive Atomics?¶
- Zero contention on data: Only the relevant thread ever touches its own index; the buffer slot is always written then the index updated.
- No locks, no blocking: No mutex, semaphore, or heavy synchronization.
- Scalable for high-throughput: On modern CPUs, with careful alignment and use of explicit atomics, this design can achieve tens or hundreds of millions of messages per second with minimal cache traffic.
Practical Usage (like OpenPilot VisionIPC, ML data streaming, etc.)¶
- Used in high-throughput situations like camera frame streaming, audio pipelines, or neural model pipes, where classic locks or blocking sync would be a performance bottleneck.
- Constraint: You must have exactly one producer and one consumer. If you need multiple, see MPMC (multi-producer multi-consumer) queues, which are much more complex.
Summary Table¶
| Feature | SPSC MengRao Queue |
|---|---|
| Threads supported | 1 producer, 1 consumer |
| Lock required? | No |
| Blocking/Waiting? | Never |
| Throughput | Extremely high |
| Caveat | Exactly 1 prod/1 cons |
| Used in | Networking, vision, ML |
See more:
- MengRao/SPSC_Queue - queue_spsc.h — See also the README for benchmarking and usage.
RCU (Read-Copy-Update)¶
Linux kernel’s primary mechanism for read-mostly shared data. O(1) read-side cost: no locks, no atomics, no cache-line writes — only preempt disable/enable.
Reader side¶
rcu_read_lock();
ptr = rcu_dereference(gp); // READ_ONCE + compiler barrier
/* use *ptr — guaranteed valid until rcu_read_unlock */
rcu_read_unlock();
Writer side (copy → modify → publish → wait → free)¶
- Allocate and populate a new version of the object.
- Publish:
rcu_assign_pointer(gp, new_obj)(smp_wmb + pointer store). - Wait for a grace period: all pre-existing readers have finished.
- Free the old object (no reader can still hold it).
new_obj = kmalloc(sizeof(*new_obj), GFP_KERNEL);
*new_obj = *old_obj;
new_obj->field = updated_value;
rcu_assign_pointer(gp, new_obj);
synchronize_rcu(); // blocks until grace period
kfree(old_obj);
call_rcu(&old->rcu_head, free_fn): asynchronous form; writer does not block; used in interrupt context or when writer must not block.
Grace period¶
A grace period is the interval after rcu_assign_pointer() until every CPU has passed a quiescent state (context switch, entry to idle, or return to user space). The kernel tracks these events; when all CPUs have quiesced, no pre-existing reader can still hold the old pointer.
RCU variants¶
| Variant | Read section can sleep? | Use case |
|---|---|---|
| Classic RCU | No | Routing tables, task_struct lookup, module parameters |
| SRCU (Sleepable RCU) | Yes | Notifier chains, subsystem registrations |
| PREEMPT_RCU | No (but preemptible) | CONFIG_PREEMPT kernels |
rcu_nocbs=<cpulist>: Offloads call_rcu callbacks to rcuoc kthreads on non-isolated CPUs. On an isolated inference core, normal RCU would run callbacks there at unpredictable times; rcu_nocbs= removes that latency source.
kfifo (Kernel SPSC FIFO)¶
DECLARE_KFIFO(my_fifo, int, 64); // size power of 2
INIT_KFIFO(my_fifo);
kfifo_put(&my_fifo, value); // producer
kfifo_get(&my_fifo, &value); // consumer
kfifo_len(&my_fifo);
Used in kernel drivers for RX buffers between ISR (producer) and process context (consumer). Kernel’s standard lock-free SPSC FIFO.
Hazard Pointers¶
Userspace alternative to RCU for lock-free reclamation:
- Reader publishes the pointer it is about to dereference into a per-thread hazard pointer slot.
- Before freeing, the reclaimer scans all hazard pointer slots for that pointer.
- If found → defer free; if not found → free is safe.
Used in Folly (Meta), Java java.util.concurrent; std::hazard_pointer in C++26.
Part 1 Summary Table¶
| Technique | Reader cost | Writer cost | Safe in ISR? | Limitation |
|---|---|---|---|---|
| Spinlock | Cache write + spin | Cache write | Yes | Wasted CPU; no sleep |
| CAS loop | Atomic RMW + retry | Atomic RMW + retry | Yes | ABA; retry cost |
| SPSC ring | Acquire load | Release store | Yes (producer or consumer) | Single producer AND consumer only |
| RCU | Preempt disable | Copy + grace period | No (synchronize_rcu blocks) | Read-mostly; writer pays |
| Hazard pointers | Publish + load | Scan HP slots | No | Reclamation overhead |
| kfifo | Acquire load | Release store | Yes | Kernel-only; SPSC only |
Part 1 — Conceptual review¶
- Why is relaxed wrong for producer-consumer signaling? Relaxed guarantees only atomicity of the atomic variable; it does not guarantee that data written before the store is visible to the reader. Use release/acquire for “data ready” flags.
- Why does RCU work without reader-side locks? The kernel detects quiescent states (context switch, idle, return to user). A grace period ends only after every CPU has quiesced, so no pre-existing reader can still hold the old pointer.
- When does CAS fail and what should the retry loop do? CAS fails when another thread changed the value. On failure, the
expectedvariable is updated with the current value; the loop should re-read dependent state, recompute the new value, and retry — not retry with stale computation. - RCU vs rwlock for model config? With rwlock, writers block new readers. With RCU, readers are never blocked; the writer works on a copy and publishes atomically. For a 100 Hz inference loop reading config every cycle, RCU is the right choice.
Part 2: Deadlock, Priority Inversion & PI Mutexes¶
Context: Deadlock = tasks waiting for each other forever. Priority inversion = high-priority task blocked by a low-priority one (via a shared resource). Both require design-time and runtime measures (ordering, PI, lockdep).
Deadlock: Definition & Coffman Conditions¶
Deadlock: A set of processes are each waiting for a resource held by another in the set; no process can ever make progress.
Process A holds L1, waiting for L2 ──► Process B holds L2, waiting for L1
Neither can proceed. Both wait forever.
Coffman conditions (all four must hold for deadlock to be possible; break any one to prevent it):
| Condition | Definition |
|---|---|
| Mutual exclusion | At least one resource is non-sharable — only one process may hold it at a time |
| Hold-and-wait | A process holds at least one resource while waiting to acquire another |
| No preemption | Resources cannot be forcibly taken; only voluntary release |
| Circular wait | A cycle exists in the resource-allocation graph: P1→R1→P2→R2→P1 |
Deadlock Prevention¶
Attack one of the four conditions at design time:
| Condition to break | Technique | Trade-off |
|---|---|---|
| Hold-and-wait | Acquire all locks at once; release all if any acquisition fails | Less concurrency; retry logic |
| Circular wait | Global lock ordering: always acquire in a fixed, documented order | Discipline at all call sites; lockdep enforces |
| No preemption | mutex_trylock() with randomized exponential backoff |
Retry overhead; livelock risk |
| Mutual exclusion | Lock-free data structures | Higher implementation complexity |
Global lock ordering (step-by-step):
- Enumerate all mutexes/locks in the subsystem.
- Assign each a numeric rank (e.g. Lock A = 1, Lock B = 2).
- Mandate: always acquire in ascending rank order.
- Document the ordering at the lock declaration site.
- Use
lockdep_set_classso lockdep can verify ordering automatically. - In code review, reject any patch that acquires a lower-ranked lock while holding a higher-ranked one.
Pitfall: Ordering breaks when new locks are added without updating the global order — two developers can introduce an A→C→D and A→D cycle. Use lockdep in CI to catch cycles.
Deadlock Detection & Recovery¶
- Resource allocation graph: Nodes = processes (P) and resources (R); edge P→R = P waits for R; edge R→P = R held by P. A cycle = deadlock.
- Wait-for graph: Simplified; edge P→Q means P waits for something held by Q; cycle = deadlock.
Recovery options after detection: (1) Abort one process and free its resources (choose victim by priority, runtime, resources held). (2) Preempt a resource from one process (requires rollback support). (3) Roll back to a safe checkpoint (requires checkpointing). lockdep in Linux detects potential deadlock at runtime when a new lock ordering is first seen — before an actual hang.
Priority Inversion¶
Mechanism (four steps):
- L (low priority) acquires mutex M.
- H (high priority) wakes and tries to acquire M — blocks waiting for L.
- M (medium priority) wakes and preempts L (M > L).
- L never runs → cannot release M → H waits indefinitely despite being highest priority.
H is effectively running at L’s priority — inverted. Duration of inversion is unbounded without PI.
Time ─────────────────────────────────────────────────────────►
H (high): [woken]──[BLOCKED on mutex held by L]────────────────►
M (med): ─────────────────────[RUNNING]────────────────────────►
L (low): [holds mutex]──[PREEMPTED by M]──────[never runs]
Insight: Inversion can occur even when every task is correct. It’s an emergent effect of interactions; code review alone cannot catch it.
Mars Pathfinder (1997)¶
Rover had periodic resets ~18 hours after landing. Root cause: unbounded priority inversion.
| Task | Priority | Role |
|---|---|---|
| bc_dist | High | Data distribution bus; needed the mutex |
| bc_sched | Low | Bus scheduler; held the mutex |
| ASI_MET | Medium | Meteorological data; CPU-intensive |
Sequence: bc_sched (low) held mutex → bc_dist (high) blocked → ASI_MET (medium) preempted bc_sched and ran → bc_sched never ran → bc_dist missed its watchdog → VxWorks watchdog fired → full system reset. Fix: Enable priority inheritance on the shared mutex (feature existed in VxWorks but was off by default). Fix was uploaded to the rover via uplink.
Lesson: PI must be the default for any mutex shared between tasks of different priorities in safety-critical or inaccessible systems.
Priority Inheritance (PI)¶
When H blocks on a mutex held by L, the kernel temporarily boosts L to H’s priority until L releases the mutex. Transitive PI: if L is also blocked on another mutex held by X, the boost propagates to X so the whole chain can run and release.
BEFORE PI: AFTER PI:
H (90): BLOCKED H (90): BLOCKED
M (50): RUNNING M (50): BLOCKED (cannot preempt L)
L (10): PREEMPTED L (10→90): RUNNING → releases mutex → H runs
- Linux:
rtmuteximplements PI; PREEMPT_RT converts mostspinlock_ttortmutex→ PI system-wide (e.g. CAN, V4L2, GPU drivers) without driver changes. - Userspace:
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);
Pitfall: PI is most effective with SCHED_FIFO/SCHED_RR. On SCHED_OTHER (CFS), “priority” is relative and the boost may not have the desired effect.
Priority Ceiling Protocol¶
Each mutex has a ceiling priority = highest priority of any task that will ever acquire it. Any task that acquires it immediately runs at the ceiling for the duration — so a high-priority waiter never has to block (L is already at ceiling). Prevents inversion without runtime discovery; requires static analysis (all acquirers and their priorities known at design time).
| PI | Priority ceiling | |
|---|---|---|
| When boost happens | When H blocks on L | On every acquisition |
| Overhead | Only when contention | Every acquisition |
| Suited to | Dynamic task sets, general RT | AUTOSAR, ARINC 653 (fixed task set) |
POSIX: PTHREAD_PRIO_PROTECT. When mandatory: AUTOSAR OS and ARINC 653 require formal bounded-inversion proofs; priority ceiling provides that statically.
lockdep — Live Deadlock Detector¶
CONFIG_PROVE_LOCKING=y, CONFIG_LOCK_STAT=y (optional, per-lock contention stats).
- Assigns each lock a lock class (from static address in kernel binary).
- Records every acquisition chain: “lock A was held when lock B was acquired.”
- Reports AB-BA cycles the first time a new ordering is seen — before an actual hang.
- Reports invalid context: e.g.
mutex_lock()from hardirq (sleeping lock in IRQ).
Example report:
WARNING: possible circular locking dependency detected
task/1234 is trying to acquire lock: (&lockB){...}, at: function_b+0x30
but task holds lock: (&lockA){...}, taken at: function_a+0x20
which lock already depends on the new lock.
- lockdep_assert_held(&lock): Documents and verifies that a lock is held at a given point.
- Enable in development/CI; disable in production (~10% overhead).
Watchdog Timers¶
If a process does not kick the watchdog within the timeout, the system resets or enters a safe state — defense-in-depth against deadlocks that slip through. On Mars Pathfinder the watchdog did detect bc_dist missing its deadline; the real fix was enabling PI so the deadline was not missed. Pitfall: Treating watchdog reset as an acceptable “recovery” for priority inversion is wrong — e.g. on an ADAS controller it forces disengagement; the correct fix is to eliminate the inversion.
Part 2 Summary Table¶
| Problem | Symptom | Detection | Solution |
|---|---|---|---|
| Deadlock | Hang; tasks blocked forever | Resource graph; lockdep | Lock ordering; acquire-all; trylock |
| Priority inversion | High-priority task stalls | Latency; watchdog | PI mutex; priority ceiling |
| Livelock | CPU busy, no progress | Profiling | Backoff; arbitration |
| Starvation | Low-priority never runs | Wait monitoring | Aging; FIFO; boost |
Part 2 — Conceptual review¶
- Why can priority inversion occur with “correct” code? It’s emergent: L holds the lock correctly, H waits correctly, M preempts correctly. The failure is in the system design (shared mutex across priority levels without PI).
- Mars Pathfinder lesson: PI existed in VxWorks but was off by default. Safety-critical systems must audit every mutex shared across priorities and enable PI.
- PI vs priority ceiling? PI boosts the holder only when a higher-priority waiter blocks (dynamic). Ceiling boosts the acquirer to ceiling on every acquisition (static). Ceiling has stronger predictability for fixed task sets (AUTOSAR, ARINC 653).
- What does lockdep not report? It does not report priority inversion — that needs latency analysis (e.g. cyclictest) or scheduling analysis.
- Why is transitive PI important? If H blocks on L and L blocks on X, X must also be boosted; otherwise L cannot run and cannot release the mutex for H.
AI Hardware Connection¶
- RCU: Live model config and LoRA swaps without pausing inference; writer publishes new config, old freed after grace period. Readers (inference threads) never block.
- SPSC ring + acquire/release: Zero-copy camera path (e.g. openpilot VisionIPC between camerad and modeld); no mutex on the 30 Hz frame path; throughput limited by memory bandwidth.
- rcu_nocbs=: On Jetson/inference-dedicated cores, offloads RCU callbacks to helper CPUs so isolated RT cores don’t see callback jitter.
- CAS-based queues: Used in VisionIPC-style pipelines for multi-producer sensor aggregation; each sensor enqueues without a mutex; inference thread drains once per cycle.
- Atomic refcounts (kref, std::atomic): Manage DMA-BUF buffer lifetime across V4L2 and GPU consumers; buffer freed when last consumer decrements to zero.
- PI mutex (rtmutex / PTHREAD_PRIO_INHERIT): Required for openpilot controlsd/plannerd and cereal shared state with plannerd (medium) and loggers (low); Mars Pathfinder is a standard reference in ASIL-B safety reviews.
- ROS2: rclcpp real-time executor docs explicitly require PTHREAD_PRIO_INHERIT so control callbacks are not starved by data-logging threads.
- lockdep: Standard in camera, ISP, and DMA driver bring-up on Jetson/i.MX before deployment; AB-BA cycles must be eliminated.
- Priority ceiling: Used in AUTOSAR ECUs with fixed task sets; stronger for ASIL-D certified functions.
- PREEMPT_RT: System-wide spinlock→rtmutex gives PI automatically on CAN, V4L2, and GPU drivers without driver code changes.
Lecture Note 05 — Combines L10 (Lock-Free: RCU, Atomics, Memory Ordering) and L11 (Deadlock, Priority Inversion, PI Mutexes).