Atomics are lock-free because they execute entirely in hardware. An atomic read-modify-write is performed by the CPU using instructions that guarantee indivisibility across cores without involving the operating system. No thread is put to sleep, no scheduler path is taken, and no context switch is required. That is the key distinction versus locks. The tradeoff is that atomics shift complexity away from the OS and into the memory model: you get low overhead and no blocking, but you must be explicit about visibility and ordering when you need cross-thread coordination.
The most common mistake in performance discussions is equating “cheap” with “no cost”. Atomics avoid kernel transitions, but they still interact with cache coherence. If many cores repeatedly write the same atomic variable, the cache line containing it will bounce between cores. That traffic can dominate runtime even though no thread ever blocks. In contrast, a Mutex can go from “fine” to “brutal” under contention because blocking involves the scheduler and context switching. This is why atomics are often faster for low-contention, short critical work and especially for hot counters, while mutexes are often simpler and better when you need to protect compound state with non-trivial invariants.
A basic starting point is the difference between an atomic counter and a mutex counter. The atomic version is a single RMW instruction (plus coherence), while the mutex version must acquire a lock, check poisoning, and release via guard drop:
This example is intentionally trivial, but it illustrates the practical reason atomics exist: they scale far better for small, frequent operations that do not require blocking. The memory ordering here is Relaxed because the counter does not need to synchronize with any other data. If you use stronger orderings “just to be safe”, you often pay for barriers you do not need, and you are not actually making the program “more correct” unless you are establishing a happens-before relationship.
Memory ordering is what turns atomics from “just an indivisible operation” into “a synchronization primitive”. A typical pattern is publishing initialized data by storing a readiness flag with Release and reading it with Acquire. This enforces the constraint that once the consumer sees the flag, it must also see the preceding writes:
This is the fundamental “publish / observe” mechanism that shows up everywhere from single-producer/single-consumer handshakes to building higher-level structures. The correctness claim here is not “no data races” (the static mut makes it unsafe); the correctness claim is about visibility ordering: Acquire/Release enforces a happens-before relationship between the write to DATA and the observation of READY == true. In safe Rust you’d publish an Arc<T>, store a pointer, or use a safe container designed for this, but the ordering principle remains the same.
A second common use case is a spin lock. A spin lock is still a lock, but it is implemented purely with atomics and never blocks in the kernel. This can be useful for extremely short critical sections where sleeping and waking would cost more than spinning. The correctness is simple: use an atomic flag and loop until you acquire it. The cost is also simple: under contention it can melt your CPU.
The compare_exchange is the key: it atomically tests and sets. If it sees false, it writes true and succeeds. The ordering is not arbitrary. Using Acquire on success ensures subsequent reads/writes in the critical section are not moved before the lock acquisition. The Release store in unlock ensures all writes in the critical section become visible before the lock is observed as free by another thread. The failure ordering is Relaxed because a failed attempt does not establish synchronization; it just means “try again”.
A more “Rust real-world” use case is a once-initialization pattern: initialize something exactly once and allow lock-free fast-path reads afterwards. If you’re building primitives, you may want a state machine with an atomic enum-like integer. The simplest version uses three states: 0 = uninitialized, 1 = initializing, 2 = initialized. Threads race to transition 0→1, one wins, does initialization, then stores 2 with Release. Readers check for 2 with Acquire:
use std::sync::atomic::{AtomicUsize, Ordering};
use std::hint;
const UNINIT: usize = 0;
const INITING: usize = 1;
const INIT: usize = 2;
static STATE: AtomicUsize = AtomicUsize::new(UNINIT);
staticmut VALUE: u64 = 0;
unsafefnget_or_init() ->u64 {
if STATE.load(Ordering::Acquire) == INIT {
return VALUE;
}
if STATE
.compare_exchange(UNINIT, INITING, Ordering::Acquire, Ordering::Relaxed)
.is_ok()
{
VALUE = 123456789;
STATE.store(INIT, Ordering::Release);
return VALUE;
}
while STATE.load(Ordering::Acquire) != INIT {
hint::spin_loop();
}
VALUE
}
This is essentially the skeleton behind “once” primitives. In production you’d avoid static mut and would use OnceLock, OnceCell, or a safe equivalent, but if you are implementing infrastructure, the atomic state machine is the relevant part. The important idea is that the Release store of INIT publishes the initialized data, and the Acquire loads ensure consumers see it.
Another frequent pattern is cancellation or shutdown signaling. This is a case where you want a cheap, lock-free flag that can be polled by many threads. Relaxed is often sufficient if no other data needs to be synchronized, because you only care about eventually observing “stop”. If stop must also “publish” other state, then it becomes an Acquire/Release problem.
This is appropriate when the stop flag is purely a signal and not part of a protocol that also publishes memory. If you do have additional data that must be visible once stop is observed, then you want store(Ordering::Release) and load(Ordering::Acquire) for the flag, and you want those additional writes to occur before the release store.
A more performance-oriented example is reducing contention by sharding a hot counter across cache lines. A single global AtomicU64 can become a coherence bottleneck under many writers. A common mitigation is to create per-thread or per-core counters and aggregate periodically. Here is a simple sharded counter using a fixed number of shards and a thread-local index. This reduces cache line bouncing because threads are more likely to hit different cache lines:
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
Practice what you learned
Reinforce this article with hands-on coding exercises and AI-powered feedback.
// Example usage: pick an index per thread.
thread_local! {
static IDX: usize = {
// cheap pseudo-unique per-thread value; in real code you may assign indices explicitly
let addr = &IDX as *const _ as usize;
addr
};
}
This pattern is extremely common in telemetry systems, allocators, and high-throughput services. The ordering is `Relaxed` because the counter itself is not synchronizing access to other memory; it’s a statistic. The point here is not correctness complexity; it’s reducing coherence pressure.
Atomics also show up when you need a “fast path” that avoids locking most of the time, with a “slow path” that falls back to a mutex for complex updates. A basic example is a cache with a generation counter: readers check a version atomically, and only lock if the version indicates stale state. The version can be incremented with `Release` to publish changes, and readers can use `Acquire` to ensure they see the latest published state before returning. The concept is straightforward: atomics guard the cheap decision; locks protect the complex structure.
Below is a larger cohesive example: a **bounded SPSC (single-producer / single-consumer) ring buffer** implemented with Rust atomics. The point is not “a queue exists” (you can use crates), but to show a correct, minimal design where **Acquire/Release is doing real work**: publishing initialized elements and observing them safely across threads, without locks and without the OS scheduler.
In an SPSC ring buffer, there is exactly one thread that pushes and exactly one thread that pops. That constraint matters because it lets us avoid CAS loops and avoid multi-writer coordination. The producer is the only writer of `tail`, and the consumer is the only writer of `head`. Both threads read the other index to decide whether the buffer is full or empty. The storage is a fixed-size circular array of slots. A push writes into the slot at `tail`, then advances `tail`. A pop reads from the slot at `head`, then advances `head`. The only hard part is the visibility rule: when the consumer sees that `tail` advanced, it must also see the element fully written into the slot; similarly, when the producer sees that `head` advanced, it must be allowed to reuse that slot because the consumer is done with it. That is exactly what `Release` and `Acquire` provide.
The implementation below uses `MaybeUninit<T>` for storage because the buffer holds “some initialized elements and some uninitialized slots” at all times. We also intentionally split the buffer into `Producer` and `Consumer` handles to make it structurally harder to misuse. You still must respect the SPSC constraint (one producer thread, one consumer thread). This code is lock-free in the practical sense: no thread ever blocks or sleeps, and operations are O(1) with bounded work.
```rust
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
/// Cache-line padding to reduce false sharing between head and tail.
/// This is a best-effort micro-optimization; it avoids head/tail living on the same line.
#[repr(align(64))]
struct CachePadded<T>(T);
pub struct SpscRing<T, const N: usize> {
// Slots contain either an initialized T (between head..tail) or uninitialized memory.
buf: Box<[UnsafeCell<MaybeUninit<T>>; N]>,
// head: next slot to be read by the consumer
// tail: next slot to be written by the producer
//
// SPSC property:
// - Only consumer writes head.
// - Only producer writes tail.
// Both may read the other's index.
head: CachePadded<AtomicUsize>,
tail: CachePadded<AtomicUsize>,
}
// UnsafeCell is used for interior mutability of the slots.
// SPSC ensures we never concurrently read+write the same slot.
unsafe impl<T: Send, const N: usize> Send for SpscRing<T, N> {}
unsafe impl<T: Send, const N: usize> Sync for SpscRing<T, N> {}
pub struct Producer<T, const N: usize> {
ring: Arc<SpscRing<T, N>>,
}
pub struct Consumer<T, const N: usize> {
ring: Arc<SpscRing<T, N>>,
}
impl<T, const N: usize> SpscRing<T, N> {
pub fn new() -> (Producer<T, N>, Consumer<T, N>) {
assert!(N >= 2, "N must be >= 2 to distinguish full vs empty");
// SAFETY: An array of UnsafeCell<MaybeUninit<T>> is fine to allocate and leave uninitialized.
let buf = Box::new(std::array::from_fn(|_| UnsafeCell::new(MaybeUninit::uninit())));
let ring = Arc::new(Self {
buf,
head: CachePadded(AtomicUsize::new(0)),
tail: CachePadded(AtomicUsize::new(0)),
});
(
Producer { ring: ring.clone() },
Consumer { ring },
)
}
#[inline(always)]
fn mask(i: usize) -> usize {
// We use modulo, not bitmasking, to work for any N.
// If you require N to be power-of-two, you can replace with `i & (N - 1)` for speed.
i % N
}
#[inline(always)]
fn next(i: usize) -> usize {
(i + 1) % N
}
}
impl<T, const N: usize> Producer<T, N> {
/// Attempts to push `value` into the ring.
/// Returns Err(value) if the ring is full.
///
/// Memory ordering:
/// - We read `head` with Acquire to observe consumer progress before overwriting slots.
/// - We write the element into the slot (normal write).
/// - We publish the element by storing `tail` with Release.
pub fn try_push(&self, value: T) -> Result<(), T> {
let tail = self.ring.tail.0.load(Ordering::Relaxed);
let next_tail = SpscRing::<T, N>::next(tail);
// If advancing tail would collide with head, buffer is full.
// Acquire ensures we see the consumer's latest head updates.
let head = self.ring.head.0.load(Ordering::Acquire);
if next_tail == head {
return Err(value);
}
let idx = SpscRing::<T, N>::mask(tail);
// SAFETY: Producer owns writes to slot at `tail`. Consumer only reads slots < tail (modulo),
// and only after observing tail via Acquire.
unsafe {
(*self.ring.buf[idx].get()).write(value);
}
// Release publishes the slot write to the consumer that does an Acquire load on tail.
self.ring.tail.0.store(next_tail, Ordering::Release);
Ok(())
}
/// A convenience helper for busy-wait style producers.
/// In real systems you often pair spinning with backoff or a parking strategy.
pub fn push_spin(&self, mut value: T) {
loop {
match self.try_push(value) {
Ok(()) => return,
Err(v) => {
value = v;
std::hint::spin_loop();
}
}
}
}
}
impl<T, const N: usize> Consumer<T, N> {
/// Attempts to pop an element from the ring.
/// Returns None if the ring is empty.
///
/// Memory ordering:
/// - We read `tail` with Acquire to observe the producer's published writes to slots.
/// - We read the element from the slot (normal read).
/// - We release the slot by storing `head` with Release, allowing producer to reuse it.
pub fn try_pop(&self) -> Option<T> {
let head = self.ring.head.0.load(Ordering::Relaxed);
Atomics// Acquire ensures that if we observe tail > head, the slot contents are visible.
let tail = self.ring.tail.0.load(Ordering::Acquire);
if head == tail {
return None;
}
let idx = SpscRing::<T, N>::mask(head);
// SAFETY: Consumer owns reads from slot at `head`. Producer writes it before publishing `tail`
// with Release. We loaded tail with Acquire, so the write is visible.
let value = unsafe { (*self.ring.buf[idx].get()).assume_init_read() };
let next_head = SpscRing::<T, N>::next(head);
// Release communicates to producer (that reads head with Acquire) that the slot is free.
self.ring.head.0.store(next_head, Ordering::Release);
Some(value)
}
pub fn pop_spin(&self) -> T {
loop {
if let Some(v) = self.try_pop() {
return v;
}
std::hint::spin_loop();
}
}
}
impl<T, const N: usize> Drop for SpscRing<T, N> {
fn drop(&mut self) {
// On drop, we must drop any still-initialized elements in the buffer.
// This is only safe if no other threads are using it anymore (normal drop rules).
let mut head = self.head.0.load(Ordering::Relaxed);
let tail = self.tail.0.load(Ordering::Relaxed);
while head != tail {
let idx = head % N;
unsafe {
(*self.buf[idx].get()).assume_init_drop();
}
head = (head + 1) % N;
}
}
}
This code is short enough to audit and demonstrates the actual synchronization edges. When the producer stores the new tail with Release, it is not “more strict because safety” but because it is the publish step: it guarantees that the slot’s initialization write becomes visible before the consumer is allowed to observe that the element exists. The consumer then loads tail with Acquire, which is the matching observe step: if it sees the updated tail, it is guaranteed to see the initialized element in the slot. Symmetrically, when the consumer stores head with Release, it is publishing that the slot is no longer in use; the producer loads head with Acquire before deciding a slot is reusable, ensuring it will not overwrite a slot that the consumer still needs. That’s the whole correctness story in terms of ordering; everything else is mechanics.
Here is a minimal usage example showing one producer thread pushing integers and one consumer thread popping them. This is not meant to be a benchmark, just a concrete end-to-end scenario:
use std::thread;
fnmain() {
const N: usize = 1024;
let (prod, cons) = SpscRing::<u64, N>::new();
letproducer = thread::spawn(move || {
foriin0..1_000_000u64 {
// choose either try_push or push_spin depending on your policy
prod.push_spin(i);
}
});
letconsumer = thread::spawn(move || {
letmut sum = 0u64;
for_in0..1_000_000 {
letv = cons.pop_spin();
sum = sum.wrapping_add(v);
}
sum
});
producer.join().unwrap();
letsum = consumer.join().unwrap();
println!("sum={sum}");
}
In real systems you typically do not want pure spinning forever, because it burns CPU under imbalance. SPSC queues are often used in low-latency pipelines where spinning is acceptable (or where you add a hybrid policy: spin for a short period, then park/yield). That policy decision is separate from the atomic correctness. The queue itself stays lock-free; you decide whether waiting is active (spin) or passive (parking) around it.
One last practical detail: this queue intentionally keeps one slot empty to distinguish full from empty using only head and tail. With N slots, usable capacity is N - 1. That design is common because it avoids an additional atomic “size” counter. If you need full capacity, you can track a separate state bit or use monotonic counters with wrapping arithmetic and compare distance, but the ordering story remains the same: publish element with Release on the producer index, observe with Acquire on the consumer side, and mirror that for slot reclamation.
If you want to extend this into something closer to production, the next logical additions are padding each slot to reduce false sharing when T is small and very hot, enforcing N as a power of two to replace modulo with bitmasking, and adding a bounded wait policy (spin-then-yield or spin-then-park) without changing the queue’s core Acquire/Release edges.
Practice what you learned
Reinforce this article with hands-on coding exercises and AI-powered feedback.