Matching Engine — System Design

Prerequisites

order-books — CLOB structure, price-time priority, order types

trading-fundamentals — spreads, market makers, the timing problem

Interview context

This note is structured as a system design interview answer. It starts with requirements gathering, moves through high-level architecture, then dives into the matching engine core, scaling, and failure modes. The companion notebook (notebook) lets you run an actual matching engine and see how it behaves under load.

The problem

Design a system that accepts buy and sell orders for financial instruments and matches them fairly, deterministically, and fast. This is the core of every exchange — NYSE, NASDAQ, Binance, CME — and one of the most latency-sensitive systems in existence.

Scale context:

Training-data note

The scale numbers below are approximate and may be outdated. They were written from Claude’s training data, not from verified exchange documentation. Use them as order-of-magnitude guidance for a system design interview, not as authoritative benchmarks. In particular: Binance has claimed ~1.4M orders/sec capacity and lists 2,000+ trading pairs. NYSE latency varies by order path (electronic vs DMM).

Metric	NYSE	Binance	Target for design
Orders/second (peak)	~1M	~100K–1M+ (claimed)	100K–1M
Matching latency (p99)	<100μs	<1ms	<1ms
Instruments	~4,000	~2,000+ (claimed)	10,000
Concurrent connections	~1,000	~100K	10,000

Step 1: Requirements

Functional

Submit orders — limit, market, with optional execution policies (IOC = immediate-or-cancel, FOK = fill-or-kill, GTC = good-til-cancel, AON = all-or-none)
Cancel/modify orders — by order ID
Match orders — price-time priority (see price-time vs pro-rata)
Publish market data — real-time depth updates, trade reports, NBBO (National Best Bid and Offer — the best available buy and sell prices across all venues)
Support multiple instruments — each with its own independent book

Non-functional

Determinism: given the same sequence of inputs, the engine must produce the same sequence of outputs. This is critical for auditing, replay, and regulatory compliance.
Fairness: orders are processed in arrival order. No participant gets preferential treatment from the engine itself. (Note: this is engine-level fairness only — access fairness is a separate problem. See exchange-fairness-and-access.)
Latency: sub-millisecond matching. The engine must not be the bottleneck.
Availability: the exchange can halt trading (circuit breakers exist), but it must never lose orders or produce incorrect matches.
Consistency over availability (in CAP (Consistency, Availability, Partition tolerance — Brewer’s theorem: a distributed system can guarantee at most two of three) terms): a matched order must never be double-filled. Correctness beats uptime.

Step 2: High-level architecture

Logical component diagram — shows data flow between software components, NOT physical deployment. In LMAX’s architecture, the sequencer (input Disruptor) and matching engine (BLP) are threads in the same JVM on the same machine, connected by a shared-memory ring buffer. The gateway is a separate process (may be same or different machine). Market data, drop copy, and risk are downstream consumers of the output ring buffer.

Components

Gateway — the entry point for all client connections. Validates messages (correct instrument ID, valid price, sufficient permissions), rate-limits per client, normalizes wire format into internal messages. Stateless — can be horizontally scaled behind a load balancer. Typical protocols: FIX (Financial Information eXchange, the industry-standard messaging protocol for trading) over TCP, or a binary protocol for lower latency.

Sequencer — the single most critical component. Assigns a monotonically increasing sequence number to every incoming message. This creates a total order — a single, unambiguous timeline of events. Without this, you cannot guarantee determinism or fairness.

Why a single sequencer?

In distributed systems terms, this is a linearization point. Every message passes through a single thread/process that stamps it. This sounds like a bottleneck — and it is, by design. The alternative (distributed ordering) requires consensus protocols (Paxos, Raft) that add latency and complexity. At the volumes we’re targeting (~1M messages/second), a single well-optimized sequencer on modern hardware can handle the load. LMAX Exchange (a London-based forex exchange) demonstrated this architecture: a single-threaded Business Logic Processor (BLP) that processes ~6 million orders per second.

Disruptor vs BLP — what the "6 million" measures

The LMAX Disruptor ring buffer library benchmarks at 15–25 million operations/sec for pure inter-thread message passing (zero-allocation, zero-lock, zero-kernel-transition). The 6 million orders/sec figure refers to the Business Logic Processor — the single-threaded matching engine sitting downstream of the Disruptor — executing actual order book lookups, fills, and state mutations.

At 6M orders/sec on a 3 GHz CPU, that’s ~500 clock cycles per order. This is achievable because:

Cache residency: the working set (top price levels, deque heads, active order map) fits in L1/L2 cache (1–15 cycles per access vs 200–300 cycles for RAM)

Zero allocation: all event objects are pre-allocated in the ring buffer — no GC pauses

Branch prediction: the common case (hit top of book, fill one resting order) is highly predictable

Single writer, no synchronization: no locks, no CAS (compare-and-swap) operations, no memory barriers beyond sequence cursors

NUMA-local memory: exchange servers typically have 2–4 CPU sockets. Each socket has its own RAM banks (NUMA = Non-Uniform Memory Access — accessing RAM on your own socket is ~80ns, accessing the other socket’s RAM is ~130ns). The 1.28 GB ring buffer does NOT fit in cache — it’s a RAM structure. The BLP thread is pinned to a specific core, and the ring buffer is allocated on that core’s socket using numactl or mmap+mbind. Every ring buffer access hits local RAM (~80ns) rather than remote RAM (~130ns). The hot path data (top-of-book price levels, deque heads, active order map) fits in L1/L2 — that’s the cache residency point above. The ring buffer itself is a RAM optimization, not a cache optimization.

Matching engine — consumes the ordered stream and maintains the order book for each instrument. Produces output events: fills (trades), acknowledgements (order accepted), and rejections. This is the heart of the system and we’ll design it in detail below.

Market data publisher — converts internal events into public market data feeds. Sends depth snapshots, incremental updates, and trade reports. Typically multicast UDP (multicast = one sender broadcasting to many receivers simultaneously, over UDP for minimum latency) for co-located participants (co-location = placing trading servers physically inside the exchange’s data center to minimize network latency, typically <1 microsecond away), WebSocket for remote.

Drop copy — a per-trader stream of their own order and fill events. The trader’s private view of what happened.

Risk engine — checks position limits, margin requirements, and circuit breakers. Can be pre-trade (blocks the order before it reaches the engine) or post-trade (monitors after matching).

Pre-trade risk is a regulatory requirement

Since SEC Rule 15c3-5 (the Market Access Rule, adopted November 2010 after the Flash Crash), pre-trade financial risk controls are legally mandatory for all broker-dealers with direct market access. They must prevent orders exceeding capital thresholds, reject erroneous prices/quantities (fat-finger filters), and apply to all participants including sponsored access. The Knight Capital incident ($440M loss in 45 minutes) demonstrated the catastrophic cost of inadequate pre-trade controls — Knight violated multiple Rule 15c3-5 provisions and was fined $12M by the SEC.

Production systems do both: mandatory pre-trade checks in the gateway (position limits, order rate limits, capital thresholds, kill switches), and comprehensive post-trade checks asynchronously (portfolio-level risk, margin calculations, regulatory reporting).

How data flows between stages

Ring buffers connect each stage; each consumer tracks its own read cursor.

Gateway → Sequencer: The gateway forwards validated messages to the sequencer over a low-latency transport — typically a persistent TCP connection or, in co-located setups, shared memory (a memory-mapped ring buffer that both processes access).

Sequencer → Matching Engine: The matching engine reads from the sequencer’s ring buffer (or journal file) as a consumer, advancing its own cursor. In LMAX’s design, this is a busy-polling read — the engine spins on the ring buffer, checking for new entries without blocking.

Matching Engine → Market Data / Drop Copy / Risk: The matching engine writes output events (fills, acks, cancels) to an outbound ring buffer. The market data publisher, drop copy service, and risk engine each consume this buffer independently, each tracking its own read cursor.

Partitioner (shown in the scaling section below): The partitioner is a stateless router that hashes the instrument ID to determine which engine instance handles it — similar to a Kafka partition assignment.

Step 3: The matching engine core

Data structure: the order book

The order book data structure: a SortedDict of price levels, each containing a deque of orders, plus a hash map for O(1) cancel.

Each instrument gets an independent order book. The book has two sides:

@dataclass
class Order:
    id: int
    side: str          # "buy" or "sell"
    price: Decimal     # limit price (None for market orders)
    qty: int           # remaining quantity
    timestamp: int     # sequence number from sequencer
 
@dataclass
class PriceLevel:
    price: Decimal
    orders: deque[Order]  # FIFO (first-in, first-out) queue at this price
 
@dataclass
class OrderBook:
    bids: SortedDict[Decimal, PriceLevel]  # descending by price
    asks: SortedDict[Decimal, PriceLevel]  # ascending by price
    orders: dict[int, Order]               # order_id → Order (for fast cancel)

Why these data structures?

SortedDict (a balanced BST (binary search tree), typically a red-black tree — a self-balancing BST that guarantees O(log N) operations) gives O(log N) insertion/deletion and O(1) access to the best price (min/max). N is the number of price levels, not orders — typically a few hundred, not millions.

Why a BST and not a heap?

A min-heap gives O(1) peek at the best price, which sounds ideal. But a matching engine needs two operations a heap can’t do efficiently:

Ordered traversal across price levels. A large incoming order sweeps through multiple levels. With a BST, after filling the best level you walk to the next via in-order successor in O(1). With a heap, getting the second-best price requires popping the min — O(log N) and destructive.

Deletion by arbitrary key. When a cancel removes the last order at a price level, you delete that price from the container. A BST does this in O(log N) by key. A heap doesn’t support keyed deletion without a separate index, and even then requires O(log N) sift operations.

The BST is an ordered map (best price + traversal + keyed operations). A heap is a priority queue (best price only). The matching engine needs the former.

deque (double-ended queue — a data structure that supports O(1) insertion/removal at both ends) at each price level gives O(1) append (new order) and O(1) popleft (fill the oldest order). This is the FIFO queue that enforces time priority.
The orders dict provides O(1) lookup by order ID, which is essential for cancel/modify operations.

Engineering insight

The total number of price levels is small (typically <1,000 per side). The hot path — finding the best bid/ask and dequeuing the front order — touches at most 2-3 cache lines. This is why matching engines can operate at microsecond latency: the working set fits in L1/L2 cache.

The matching algorithm

def process_order(book: OrderBook, incoming: Order) -> list[Fill]:
    """Core matching loop — price-time priority."""
    fills = []
 
    if incoming.side == "buy":
        opposite = book.asks    # walk asks ascending
        is_match = lambda ask_price: incoming.price is None or incoming.price >= ask_price
    else:
        opposite = book.bids    # walk bids descending
        is_match = lambda bid_price: incoming.price is None or incoming.price <= bid_price
 
    while incoming.qty > 0 and opposite:
        best_price, level = opposite.peekitem(0)  # best available price
        if not is_match(best_price):
            break  # no more matchable prices
 
        while incoming.qty > 0 and level.orders:
            resting = level.orders[0]
            fill_qty = min(incoming.qty, resting.qty)
            fills.append(Fill(
                price=best_price,
                qty=fill_qty,
                maker_id=resting.id,   # maker = the resting order that provides liquidity
                taker_id=incoming.id,  # taker = the incoming order that consumes it
            ))
            incoming.qty -= fill_qty
            resting.qty -= fill_qty
            if resting.qty == 0:
                level.orders.popleft()
                del book.orders[resting.id]
 
        if not level.orders:
            del opposite[best_price]  # remove empty price level
 
    # If limit order has remaining qty, add to book
    if incoming.qty > 0 and incoming.price is not None:
        _add_to_book(book, incoming)
 
    return fills

Execution policies

Execution policies modify what happens when an incoming order is not fully filled:

Policy	Behavior if not fully filled	Implementation
GTC (good-til-cancel)	Rest on the book (default above)	Single-pass matching, add remainder to book
IOC (immediate-or-cancel)	Cancel the unfilled remainder. Partial fills allowed.	Single-pass matching, discard remainder
FOK (fill-or-kill)	Reject the entire order — no partial fills, no resting	Two-pass: peek scan first, then execute only if sufficient liquidity
AON (all-or-none)	Must fill entirely, but may rest on the book until fillable	Special handling: skip if incoming can’t satisfy; lower queue priority

FOK requires a peek pass before the matching loop — scan the book to verify enough liquidity exists at matchable prices, but don’t modify anything. Only if the check passes does the engine run the actual matching.

AON orders complicate the matching loop because the engine must check whether each resting AON order can be fully satisfied by the incoming order’s remaining quantity — if not, skip it and continue to the next resting order. Most implementations treat AON as a special case outside the main matching loop. AON is relatively rare on equity venues but more common in block trading.

Price-time vs pro-rata: incentive effects

The choice of matching algorithm fundamentally changes participant behavior. See Order Books for the mechanics and comparison table. The key behavioral insight not covered there:

Why CME interest rate futures (Eurodollar/SOFR, Treasuries) use pro-rata: These are large, institutional markets where participants trade in sizes of thousands of contracts. Pure FIFO would reward only the fastest participant at a given price, making it difficult for large block flow to get filled at the best price. Pro-rata distributes fills across all providers of liquidity at a price, enabling larger institutional orders to receive execution.

The tradeoff: on pro-rata books, participants inflate displayed size to increase their allocation percentage — creating “phantom liquidity” that will be cancelled if too much is filled. Displayed depth on pro-rata books is less informative as a signal of genuine supply/demand than on FIFO books.

Cancel and modify

def cancel_order(book: OrderBook, order_id: int) -> bool:
    if order_id not in book.orders:
        return False
    order = book.orders.pop(order_id)
    side = book.bids if order.side == "buy" else book.asks
    level = side[order.price]
    level.orders.remove(order)
    if not level.orders:
        del side[order.price]
    return True

Production vs. teaching code

The cancel code above uses deque.remove() which is O(N) per price level. Production engines use a doubly-linked list where each order holds a pointer to its own node, giving true O(1) removal. The complexity table below reflects the production design.

Modify is typically implemented as cancel + new order. This is simpler and safer than in-place modification — a price change must lose time priority (otherwise you could game the queue by submitting at a bad price and improving later), and implementing it as cancel-replace makes this automatic.

Complexity analysis

Operation	Time	Notes
Best bid/ask	O(1)	Peek at sorted container
Add order	O(log P)	P = number of price levels
Cancel order	O(1) amortized	Dict lookup + linked list removal
Match (per fill)	O(1)	Dequeue front of FIFO
Full match walk	O(F + log P)	F = number of fills

The dominant cost in practice is not algorithmic — it’s memory access patterns. A cache-friendly layout (price levels in contiguous memory, orders in a deque backed by a ring buffer) matters more than asymptotic complexity at these scales.

Step 4: The sequencer — making it deterministic

The sequencer creates a total order. Every downstream component — matching engine, risk engine, market data publisher — reads from this same ordered stream and produces deterministic output.

Implementation: a single-threaded process that reads from a network buffer and writes to a sequential log (an append-only file or ring buffer in shared memory). Each message gets a 64-bit sequence number.

Why single-threaded? Multithreading introduces non-determinism (thread scheduling varies across runs). A single thread guarantees that the sequence is identical across replays. LMAX (a London-based forex exchange) called this design principle mechanical sympathy (designing software that works with the hardware’s strengths — term coined by Martin Thompson of LMAX).

The LMAX thread pipeline

The input disruptor feeds three parallel consumers (unmarshaler, journal writer, replicator); the sequence barrier gates the BLP until all three complete; the output disruptor fans results to downstream consumers.

The actual LMAX architecture has distinct thread stages connected by ring buffers (Disruptors). The pipeline, as documented by Martin Fowler (2011):

Input side (Input Disruptor):

Network Receiver thread — reads raw bytes from TCP/shared memory. Its only job is I/O: pull bytes off the wire and write them into slots in the input ring buffer. Does not deserialize.
Three parallel consumers of the input ring buffer, each tracking their own sequence cursor:
- Unmarshaler — parses raw bytes into typed event objects (NewOrderEvent, CancelEvent). Enriches the pre-allocated slot in place (the slot is reserved by the producer; downstream stages fill in typed fields).
- Journal Writer — appends raw bytes to the durable log (memory-mapped append-only file).
- Replication Writer — sends raw bytes to the standby sequencer over kernel-bypass networking.
Sequence Barrier — the BLP has a dependency that says “do not process slot N until the unmarshaler, journal writer, AND replication writer have all finished slot N.” This is expressed via sequence cursors, not locks.
Business Logic Processor (BLP) — runs only after the barrier confirms all three upstream consumers are complete. Executes matching logic and writes output events to the Output Disruptor.

Output side (Output Disruptor):

The BLP writes fills, acks, and rejects to the output ring buffer. Downstream consumers (market data marshaler, drop copy marshaler, risk engine) each read independently via their own cursors.

Single writer enriching pre-allocated slots

A key nuance: the ring buffer has one producer (the network receiver) claiming slots. The unmarshaler doesn’t “write” to the ring buffer in the traditional sense — it mutates the same pre-allocated slot in place. This is not “two writers” — it’s one writer claiming, then downstream stages enriching. The BLP reads the fully-enriched form.

Ring buffer sizing

The LMAX input ring buffer was documented at 20 million slots (the output buffer at 4 million slots per topic). At 6M orders/sec, a 20M slot buffer fills in ~3.3 seconds if the BLP stops consuming entirely.

This is intentional — the ring buffer is a burst-smoothing buffer, not a deep queue:

Under normal operation, the BLP keeps up. The buffer fills and drains continuously; it is never near full.
The buffer absorbs variance (bursty inbound traffic from market opens, news events) while the BLP runs at a steady pace.
If the BLP falls behind for more than a few seconds, the producer blocks — the next() call spins waiting for the consumer’s sequence cursor to advance, freeing a slot. This is backpressure.
20M slots at ~64 bytes per slot ≈ 1.28 GB. Pre-allocated at startup (no page faults on the hot path). The size is rounded to a power of 2 (2^25 ≈ 33M) so the modulus is a single bitwise AND — one cycle.

Step 5: Scaling — one book per instrument

The sequencer fans out to a partitioner that routes by instrument ID to independent engine instances.

The matching engine is embarrassingly parallel across instruments. AAPL’s order book and TSLA’s order book share no state. The architecture:

Sequencer ──▶ Partitioner ──▶ Engine(AAPL)
                           ──▶ Engine(TSLA)
                           ──▶ Engine(GOOG)
                           ──▶ ...

The partitioner routes each message to the correct engine by instrument ID. Each engine is a single-threaded process owning one (or a few) instruments. With 10,000 instruments across 64 cores, each core handles ~150 books.

Partition topology in production

Each matching engine instance is a complete, independent system: its own sequencer (primary + standby), its own order book state, its own output ring buffers. CME Globex uses Market Segment Gateways (MSGWs): each MSGW handles a specific group of instruments (a “market segment” — e.g., all equity index futures, all interest rate futures). Traders connect either to a specific MSGW or to a Convenience Gateway (CGW) that routes transparently to the correct MSGW.

Partition assignment is static. Each instrument is assigned to a partition at configuration time — stored in a central reference data registry (instrument ID → partition ID → engine network address). The gateway reads this registry to route orders. There is no dynamic repartitioning under load — moving an instrument requires a maintenance window (transferring open orders, journal history, subscriber routing).

Multi-leader topology. With 64 engine instances, each has its own leader (primary sequencer) and standby. There is no global leader. Each partition fails over independently — if the AAPL/TSLA/GOOG engine fails, those instruments halt while their standby promotes, but AMZN/META/NVDA on a different partition continue trading. This is the key advantage: fault isolation at the instrument partition level.

Multi-leg and complex orders

A calendar spread order enters the Complex Order Book (COB), which can match against other complex orders or leg into individual books with atomic reservation.

Multi-leg orders — orders requiring simultaneous execution across two or more instruments — are common in derivatives markets:

Calendar spreads: buy March contract, sell June contract (same underlying)
Options spreads: buy call at strike K1, sell call at K2 (vertical spread)
Options combos: straddles, strangles, butterflies — up to 16 legs on CBOE

The central challenge is atomicity: either all legs fill or none fill. This is fundamentally a partitioning problem — and the solution depends on whether the legs live on the same engine partition.

Why co-location gives you atomicity for free

If all legs of a spread are on the same partition, the single-threaded BLP matches across multiple books in one pass — no distributed transaction needed. The sequencer provides the total order; single-threaded execution provides atomicity. No 2PC, no locks, no coordination protocol. This is why exchanges invest heavily in partition design: the partition boundary is the atomicity boundary.

Back-of-the-envelope: what fits on one partition?

Futures (calendar spreads). One underlying (e.g., E-mini S&P 500) has ~8 quarterly + 4 monthly expiries = 12 outright contracts. The number of 2-leg calendar spreads is $(2 12) = 66$ . Add butterflies (3-leg): $(3 12) = 220$ . Total books on the partition: 12 outrights + 66 calendars + 220 butterflies ≈ 300 books. Memory: each book at ~50 KB (500 price levels × ~100 bytes/level) = 15 MB. Fits trivially in L3 cache. One BLP can handle this at millions of events/sec.

Options (the hard case). One equity underlying (e.g., AAPL) with 20 strikes × 12 monthly expiries × 2 (call/put) = 480 outright instruments. The number of 2-leg option spreads is $(2 480) = 114, 960$ . Obviously you cannot register 115K spread instruments as first-class books. This is why options exchanges don’t use the “first-class spread” approach for the long tail — they use the COB approach (see below) or let the market handle it via client-side legging.

Partition memory budget. 500 books × 50 KB = 25 MB. 2,000 books × 50 KB = 100 MB. Both fit comfortably in a single NUMA node’s local RAM (typically 64–256 GB per socket). The constraint isn’t memory — it’s the throughput of a single BLP thread and the combinatorial explosion of spread types.

Throughput budget. At 167 ns/event (LMAX numbers), one BLP handles ~6M events/sec. A partition with 500 books receiving 200K events/sec total uses ~3% of capacity. Even at 10× peak load, a single BLP has headroom. Multi-leg matching is more expensive per event (touching 2–16 books instead of 1), but even at 10× the per-event cost, a 2M events/sec partition is well within budget.

The three production approaches

1. First-class spread instruments (CME approach). Register the most common spread types as standalone instruments with their own order books, co-located on the same partition as their legs. A calendar spread on E-mini S&P lives on the same MSGW as the outright E-mini contracts. The BLP treats the spread as a single instrument; when it fills, the engine algorithmically computes the individual leg fills (e.g., spread price S allocated across leg prices M and J such that S = M − J).

This works when the number of spread types is manageable — tens to low hundreds per underlying. The combinatorics above show why: 66 calendar spreads for 12 expiries is fine; 115K option combos is not.

No-arbitrage enforcement via the implied pricing engine. Because the spread book and all its outright leg books live on the same BLP thread, the engine can enforce no-arbitrage directly — no coordination protocol needed.

A single BLP pass: a new March bid triggers implied pricing, injects a synthetic spread bid, and atomically fills across three books — all in one function call.

The mechanism, step by step:

A new order arrives on the March outright book (bid at 100)
The BLP processes it: inserts into the March book
Immediately after, the BLP runs the implied pricing pass: for every spread that includes March as a leg, compute whether the updated March book creates a new implied price
March bid at 100 + June offer at 97 → implied bid on Mar-Jun calendar at 100 − 97 = 3. The BLP injects this synthetic order into the calendar spread book
If there’s a resting offer on the calendar spread at ≤ 3, it matches. The BLP atomically fills the spread order AND the outright legs (dequeuing from March bids and June offers) in the same function call

Steps 2–5 are one sequential execution on one thread touching three in-memory SortedDict structures. No network hop, no lock, no 2PC. The single-threaded BLP is both the matching engine and the arbitrage enforcer.

The implied pricing runs in both directions:

Implied-in (outrights → spreads): outright book changes generate synthetic orders on spread books (the example above)
Implied-out (spreads → outrights): a spread order can imply synthetic orders back into outright books. If someone bids 4 on the Mar-Jun calendar and June has an offer at 97, the BLP can imply a synthetic bid at 101 on the March outright

Second-order implications are possible: an implied-out order on March could trigger a new implied-in on a different spread (e.g., Mar-Sep). The BLP iterates until no new implied orders are generated (fixed-point convergence). This is computationally expensive — for 12 expiries with 66 calendar spreads, each outright update can trigger hundreds of implied price checks. It is tractable only because: (a) the number of registered spreads is bounded (≤ few hundred, not 115K), (b) the BLP does nothing else, and (c) most checks short-circuit early (no matchable price).

Arbitrageurs also enforce no-arbitrage passively by trading the spread against the legs when prices diverge. On liquid instruments this alone keeps prices in line. The implied engine is the belt; arbitrageurs are the suspenders.

The partition boundary = the atomicity boundary

Two independent partitions: spreads within each partition are atomic (same BLP); cross-partition spreads (ES-vs-SOFR) cannot be atomic and fall back to client-side legging.

Co-locating more instruments on one partition enables more atomic spreads. But there’s a hard limit — and it’s not memory.

Resource	Estimate	Binding?
Memory	2,000 books × 50 KB = 100 MB	No — NUMA node has 64–256 GB
Memory (extreme)	10,000 books × 50 KB = 500 MB	Still no
BLP throughput (simple)	6M events/sec at 167 ns/event	Plenty for single-instrument matching
BLP throughput (with implied pricing)	Each outright update triggers 66+ implied checks → ~100× overhead → 60K effective events/sec	Yes — this is the wall

The binding constraint is implied pricing fan-out. Each outright book update triggers a scan of every spread that includes that outright as a leg. With 12 expiries and 66 calendar spreads, each update checks ~11 spreads (one per other expiry). With second-order implications, the fan-out grows further. At peak load (market open, news events), a liquid product class like E-mini S&P can generate 50K+ outright events/sec. Multiply by the implied pricing overhead and the BLP approaches saturation.

What happens when you must split:

Instruments that need atomic spread matching go on the same partition (same underlying, all its expiries, all its registered spreads)
Instruments that don’t interact go on different partitions (E-mini S&P and Eurodollar futures have no spread relationship, so they sit on separate MSGWs)
Cross-partition spreads — e.g., a spread between two products on different partitions — cannot be atomic at the engine level. They fall back to client-side legging with leg risk, or the exchange simply doesn’t offer them as spread instruments

This is a genuine, unavoidable tradeoff. More co-location means more atomicity but slower per-event processing. Less co-location means faster matching but no cross-partition spreads. Exchange architects choose partition boundaries to maximize the atomic spread universe for the instruments that actually trade as spreads, while keeping the implied pricing overhead within the BLP’s throughput budget.

2. Dedicated Complex Order Book (COB). Used by options exchanges (CBOE, ISE) where the combinatorial explosion makes first-class spread instruments impractical. A single partition runs both the individual leg books AND a COB for that product class. The COB supports two matching paths:

COB-vs-COB: match a complex order against another complex order at the same net price. Example: a buy order for the AAPL Jan 150/160 call spread matches against a sell order for the same spread. Single book lookup, same as any other match.
COB-vs-legs: the engine checks whether the individual leg books have sufficient liquidity at prices that satisfy the complex order’s net price. If yes, it atomically lifts from each leg book in one pass.

Why this isn’t a 2PC. All the leg books and the COB are on the same partition, same BLP thread. “Atomically lifts each leg book” means the BLP reads multiple SortedDict data structures in a single function call and either fills all legs or none. There is no network hop, no lock, no commit protocol — it’s a single-threaded function that touches multiple in-memory data structures. The sequencer already serialized the input; the BLP’s single-threadedness provides atomicity.

3. Client-side legging. The broker sends individual leg orders sequentially. Not atomic — exposes the trader to leg risk (first leg fills, second doesn’t, leaving an unwanted position). Used when:

The legs trade on different exchanges (e.g., buy AAPL stock on NYSE, sell AAPL call on CBOE — a covered call that can never be atomic at the exchange level because the legs are on different matching engines in different data centers)
The spread type isn’t supported by the exchange’s COB
The trader is willing to accept leg risk in exchange for better execution on each individual leg

Confidence note

The partition sizing math (books per partition, memory budget, throughput budget) is engineering estimation from first principles. The architectural patterns (first-class spreads, COB, implied pricing) are well-documented in CME and CBOE technical specifications. The claim that COB-vs-legs atomicity comes from single-BLP co-location (not 2PC) is an engineering inference — it follows necessarily from the latency constraints, but specific exchange implementations are not public.

Step 6: Fault tolerance

The journal is the source of truth

Every message the sequencer stamps is written to a durable, append-only journal (similar to a WAL — write-ahead log — in databases). Recovery is simple:

Start a fresh matching engine
Replay the journal from the beginning (or from a snapshot)
The engine rebuilds its exact state deterministically

This is event sourcing — the journal of ordered events is the system of record, and the order book is a derived view. Benefits:

Replay: reproduce any bug by replaying the journal
Audit: regulators can verify every match from the log
Hot standby: a replica engine consumes the same journal in parallel, ready to take over if the primary fails

Failover

Primary and standby engines consume the same replicated journal; standby promotes on primary failure.

The standby reads the same journal and maintains the same state. If the primary fails, the standby promotes itself. Because the state is deterministic (same journal → same state), there is zero data loss provided replication is synchronous — meaning the primary does not process event N+1 until the standby confirms receipt of event N. With asynchronous replication, events processed between the last confirmed replication and the failure are lost.

Recovery time depends on journal size. Periodic snapshots — a serialized copy of the order book at a known sequence number — let the standby start from a recent snapshot instead of replaying from the beginning.

Synchronous replication: dual-gated Disruptors (LMAX)

The LMAX architecture solves the failover consistency problem through dual-gated synchronous replication — both the input and output Disruptors gate on standby acknowledgement before allowing downstream processing. This eliminates the “gap” problem entirely: the standby always has everything that clients have seen.

The invariant: No event is processed by the BLP until the standby has durably received it, AND no output event reaches clients until the standby has durably received that output too.

Both the input and output Disruptors gate on standby ACK. The red dashed lines are sequence barriers — the BLP and Network Publisher cannot advance until all upstream consumers (including replication) complete.

Why replicate output if the BLP is deterministic? Input replication alone guarantees the standby can reproduce any output — given enough time. But the standby’s BLP may lag behind the primary’s. If the primary sends a fill to a client and then dies, the standby hasn’t processed that input yet. The client has a confirmed fill; the standby doesn’t know about it. Output replication closes this timing gap: clients only see output that the standby already has. On promotion, the standby can immediately serve clients without waiting for its BLP to catch up.

Why this eliminates the STONITH gap: In a naive architecture with only input replication, the primary could process events N+1 through N+5 and send fills to clients before the standby receives them. On failover, clients have seen fills the standby doesn’t know about — irrecoverable divergence. With output gating, fills only reach clients after the standby has the corresponding output. If the primary dies mid-processing, the standby has either:

The complete input (replicated) but not yet processed → it processes on promotion, producing the same deterministic output
The complete output (replicated) but not yet sent to clients → it sends on promotion

Either way, no client ever sees an event the standby cannot reproduce.

Batching reconciles throughput with replication latency

A naïve implementation — one network round-trip per event — would cap throughput at 1/RTT events per second. With 10μs RTT to a co-located standby, that’s only 100K events/sec. LMAX achieves 6M events/sec through batching:

The replication writer accumulates events into batches
One ACK covers an entire batch
At 6M events/sec with 10μs RTT: ~60 events batch per round-trip
Amortized replication cost: 10μs / 60 ≈ 167 nanoseconds per event

The batch size is adaptive — larger under load (more events accumulate during the RTT), smaller when idle. This is the same principle as TCP Nagle’s algorithm: trade a tiny latency increase for massive throughput gain.

Three-network physical separation

Production LMAX-style deployments use three physically separate networks:

Network	Purpose	Traffic
Trading	Client orders and fills	High bandwidth, latency-critical
Replication	Primary ↔ standby state sync	High bandwidth, reliability-critical
Heartbeat/management	Failure detection, STONITH	Low bandwidth, must not share fate with trading

The heartbeat network must be physically separate so that a trading network failure (switch crash, cable cut) does not prevent failure detection. If heartbeat and trading shared a switch, a switch failure would kill both — the standby wouldn’t know the primary is unreachable and wouldn’t promote.

Standby failure handling

What happens when the standby dies (not the primary)?

The replication writer’s next() call blocks — it cannot advance because the standby never ACKs
The sequence barrier blocks the BLP — it cannot process new events
The primary deliberately halts all trading

This is an intentional design choice: halt over data loss. The alternative — switching to async replication when the standby dies — would create a window where events could be lost on subsequent primary failure.

The resolution is a circuit breaker with a configurable timeout:

If the standby doesn’t ACK within T milliseconds (e.g., 50–100ms), the primary enters degraded mode: continue processing without replication, accepting the risk of data loss if the primary also fails
Alert operators immediately — the system is running without a safety net
Once the standby recovers, it replays from the journal to catch up, then replication resumes

Preventing split-brain

The critical risk: if the heartbeat link fails but the primary is still alive, both nodes may believe they are the leader — catastrophic for determinism. Because replication is synchronous and dual-gated, the STONITH gap problem from async architectures doesn’t apply here. But split-brain prevention still requires explicit mechanisms:

STONITH (Shoot The Other Node In The Head): upon promoting, the standby sends a hardware-level command (via IPMI/iLO on the dedicated management network) to power off or isolate the primary. The separate heartbeat network ensures the STONITH command reaches the primary even if the trading network is down.
Lease-based leadership: the primary holds a time-limited lease. If it cannot renew (e.g., heartbeat network partitioned), it voluntarily halts before the lease expires. The standby waits for lease expiry before promoting. Lease duration is a lower bound on failover time.
External arbiter/witness: a third node (not in the data path) holds a tiebreaker quorum vote. With three nodes and majority requirement, split-brain requires two simultaneous network partitions.

Confidence note

The dual-gated replication and batching mechanisms are well-documented in LMAX’s public technical materials (Martin Fowler 2011, Martin Thompson presentations). The three-network separation and circuit breaker behavior are inferred from HA best practices and the logical requirements of the architecture — specific exchange implementations may vary in timeout values and degraded-mode policies.

Failover timing budget

Total failover duration: 50–500ms. Trading halts during this window.

Phase	Duration	Mechanism
Heartbeat detection	3–50 ms	Configurable interval, 3–5 missed beats trigger
STONITH / lease expiry	10–100 ms	Hardware command or timer-based
Standby promotion	~microseconds	Already running with full state (dual-gated)
Downstream reconnection	10–100 ms	Consumers re-subscribe to new primary’s output
Total	50–500 ms	Trading halts, then resumes

This is acceptable because: (1) exchanges have circuit breakers anyway, (2) a brief clean halt beats a split-brain double-fill scenario, and (3) all participants experience the same halt — no fairness violation.

Notifying downstream consumers

When the primary switches, output consumers (market data, drop copy, risk) must know their source changed. Production mechanisms:

ServiceAvailability messages on a dedicated control channel
Sequence number gaps or resets — feed handlers monitor for discontinuities as a failover signal
Virtual IP failover — the standby takes over the primary’s IP
Multicast group change — consumers monitor both groups, switch to whichever is actively publishing

The snapshot + incremental pattern in market data is partly designed with failover recovery in mind: a consumer that misses events during a failover gap can resync on the next snapshot without replaying history.

Timestamp determinism

A subtle requirement: timestamps must be part of the input event (assigned by the sequencer at ingestion time), not generated during BLP processing. If the BLP called System.nanoTime() during matching, the standby would produce different timestamps on replay — breaking determinism. The sequencer stamps each event with a monotonic sequence number and a wall-clock timestamp at ingestion; the BLP treats both as immutable fields of the event.

Step 7: Gateway and wire protocol

FIX protocol

FIX (Financial Information eXchange) is the dominant wire protocol for order routing. It’s a tag-value text protocol dating from 1992:

8=FIX.4.4|35=D|49=CLIENT1|56=EXCHANGE|11=ORD001|55=AAPL|
54=1|44=150.25|38=100|40=2|59=0|10=128|

Tag	Meaning	Value
35=D	Message type	New single order
54=1	Side	Buy
44=150.25	Price	150.25
38=100	Quantity	100 shares
40=2	Order type	Limit
59=0	Time-in-force	Day (cancel at market close)

FIX is verbose and slow to parse. Parsing a typical NewOrderSingle message (~200 bytes) takes 1–5 microseconds — sequential byte scanning for SOH (ASCII 01) delimiters, integer tag parsing (atoi), value dispatch by field type.

Binary protocols: SBE and the 100x speedup

High-frequency participants use binary protocols that encode the same fields in fixed-size binary frames. SBE (Simple Binary Encoding — developed by the FIX Trading Community, used by CME’s iLink 3) encodes fields at predetermined byte offsets. Parsing is a single memcpy or struct overlay — no delimiter scanning.

SBE benchmarks: ~30–50 nanoseconds per message for serialization/deserialization, over 30 million messages per second on a single thread. This is 20–100x faster than FIX parsing.

Protocol	Exchange/Venue	Encoding	Typical parse latency
FIX 4.x	Universal (legacy)	ASCII text (tag=value\|)	1–5 μs
FAST	Market data feeds	Variable-length binary	200–500 ns
SBE	CME iLink 3	Fixed-width binary	30–50 ns
OUCH	NASDAQ (order entry)	Fixed-width binary	~100 ns
ITCH	NASDAQ (market data)	Fixed-width binary	~100 ns

At 6M orders/sec, the difference between 1μs FIX parsing and 50ns SBE parsing is the difference between parsing consuming 6,000 cycles vs 150 cycles per message — significant when the entire matching operation budget is 500 cycles.

Gateway responsibilities

Authentication — verify client identity (API key, certificate)
Validation — correct instrument, valid price (within tick size), valid quantity (within lot size), order type allowed
Rate limiting — per-client message rate caps (prevents accidental floods and DoS)
Normalization — convert wire format to internal format
Pre-trade risk — position limits, margin checks, capital thresholds, fat-finger filters. Legally required by SEC Rule 15c3-5 (see warning callout in Step 2).
Kill switch — hardware-enforced ability to halt all outbound orders from a single component within microseconds. Post- Knight Capital, this is industry standard.

The gateway is stateless and horizontally scalable. A large exchange might run 50-100 gateway instances behind a TCP load balancer.

Step 8: Market data

L1 (top-of-book) vs L2 (aggregated depth) vs L3 (individual orders) feed hierarchy.

Two feeds, very different consumers:

L1: Top-of-book (NBBO)

{"instrument": "AAPL", "bid": 150.25, "bid_size": 300,
 "ask": 150.26, "ask_size": 500, "last": 150.25, "last_size": 100}

Small, high-frequency updates. Used by retail apps, data vendors, algorithmic traders for signal generation. Published via UDP multicast (co-located) or WebSocket (remote).

L2/L3: Depth of book

{"instrument": "AAPL", "bids": [
  {"price": 150.25, "size": 300, "count": 2},
  {"price": 150.24, "size": 800, "count": 5},
  ...
], "asks": [...]}

L2 shows aggregated size at each price level. L3 shows individual orders. Much larger messages, typically delta-encoded (send only what changed since the last snapshot). Used by market makers and sophisticated algorithmic strategies that need to see the full book shape.

Snapshot + incremental pattern: publish a full snapshot every N seconds, and incremental deltas in between. Clients that miss a delta can resync from the next snapshot.

Interview discussion points

“How would you handle a market order that would cause a flash crash?”

Circuit breakers. The matching engine checks LULD (Limit Up / Limit Down) bands on a per-fill basis before writing any fill to the output ring buffer:

A reference price is computed as the average transaction price over the preceding 5-minute window
Price bands: ±5% for Tier 1 securities (S&P 500, Russell 1000), ±10% for Tier 2 during normal hours
If a fill would occur outside the band → Limit State (15-second pause, limit orders only within the band)
If Limit State unresolved in 15 seconds → 5-minute trading halt

The band boundaries are updated every 30 seconds by a reference data process that writes to a shared memory segment. The matching engine reads this segment on the hot path — a single memory load per fill (trivial cost).

In addition, market-wide circuit breakers halt all trading:

Level	S&P 500 decline	Action
Level 1	7%	15-minute halt (before 3:25pm ET)
Level 2	13%	15-minute halt (before 3:25pm ET)
Level 3	20%	Halt for remainder of day

These were adopted after the Flash Crash of 2010 (~998-point Dow drop in 36 minutes). The previous thresholds (10%/20%/30% from the 1987 crash rules) were too high and too slow.

“What’s the biggest risk in this design?”

The sequencer is a single point of failure. Mitigation: replicate the journal synchronously to a standby sequencer using kernel bypass networking (DPDK (Data Plane Development Kit — an Intel framework that bypasses the OS kernel’s network stack for lower latency) or Solarflare OpenOnload (a kernel-bypass networking stack from Solarflare/AMD)) for <10μs replication latency. Solarflare has published benchmarks showing ~2.9μs round-trip over 10GbE. Failover adds 50–500ms of trading halt (see the failover timing budget above).

“Why not use a database?”

The matching engine is an in-memory state machine. A database adds:

Disk I/O latency (~1ms for SSD, vs <1μs for RAM)
Transaction overhead (locking, WAL writes)
Query parsing

The journal is the durable store. The order book is a derived in-memory cache. This is the same pattern as Redis with AOF (append-only file) persistence — memory for speed, journal for durability.

“What about the Facebook IPO failure?”

Auction mechanisms (IPO opening cross) and continuous matching have fundamentally different computational profiles. An opening cross over millions of accumulated orders is O(N) per recalculation — not O(log P) as in continuous matching. They require separate stress testing, separate resource allocation, and often separate code paths. NASDAQ’s 2012 failure (30-minute delay, double-fills, $62M compensation) demonstrated this.

“How does this compare to an AMM?”

Dimension	CLOB matching engine	AMM (Uniswap)
Price discovery	Emergent from orders	Deterministic from formula
Latency	<100μs	~12s (Ethereum block time)
Market maker role	Active (quotes, cancels)	Passive (deposits, waits)
Adverse selection	MM can update quotes	LP cannot — see impermanent-loss
Throughput	~1M orders/sec	~15 txns/sec (Ethereum L1)
Fairness	Arrival order (sequencer)	Block producer ordering — see blockchain-transaction-lifecycle
Transparency	Lit book is public	All state is on-chain (fully transparent)

This table connects directly to the DeFi learning path: the AMM exists because on-chain CLOBs can’t match TradFi latency, so they replaced the matching engine with a mathematical formula (the bonding curve). The tradeoff is that LPs lose the ability to update quotes — which is why impermanent loss is a structural cost, not a bug.

Engineering lessons for system design interviews

Single-writer principle: the sequencer’s single-threaded design is not a limitation — it’s a feature. Determinism is worth more than parallelism at this layer.
Event sourcing: the journal-as-truth pattern appears in exchanges, databases (WAL), distributed systems (Raft log), and blockchains (the chain itself is a sequenced journal of transactions).
Mechanical sympathy: knowing that L1 cache is 1ns, L2 is 3ns, RAM is 100ns, and SSD is 100μs explains why in-memory order books with cache-friendly layouts beat databases by 3-4 orders of magnitude.
Partition by instrument: embarrassingly parallel workloads should be partitioned, not distributed. No consensus protocol needed when instruments don’t share state.

Questions to sit with (answered)

1. Scaling beyond a single core without speeding it up

Question: If you needed 10× throughput but couldn’t speed up a single core, what architectural change would you make — and what property would you sacrifice?

Answer: Partition by instrument (the architecture already described in Step 5). Each instrument gets its own single-threaded sequencer and BLP on its own core. You sacrifice cross-instrument global ordering — two orders for different instruments no longer have a guaranteed relative sequence. This doesn’t affect matching correctness (each book is independent) but does affect:

Cross-instrument operations (spread orders need a coordinator)
Market-wide surveillance (reconstructing a global timeline requires merging per-instrument journals post-hoc)
Risk calculations spanning multiple instruments (portfolio-level risk must aggregate asynchronously)

Determinism and fairness within each instrument are fully preserved. The tradeoff is worthwhile because >95% of operations are single-instrument.

2. Consistency vs availability under extreme conditions

Question: How do production exchanges reconcile strict consistency (no double-fills) with near-continuous availability?

Answer: The journal/snapshot failover model (Step 6) provides the reconciliation. The timing budget is 50–500ms — brief enough that participants experience it as a routine trading halt, not a system failure. The mechanisms that make this work:

STONITH / lease-based leadership prevents split-brain (the only scenario that produces double-fills)
The standby already has full state (it’s been consuming the journal in real-time), so promotion is microseconds
Downstream consumers resync via the snapshot + incremental pattern

The model breaks under correlated failures — if both primary and standby fail simultaneously (e.g., power loss to the entire data center, or a software bug triggered by a specific input that crashes both instances). Mitigations: geographically separate standby sites (adds replication latency), diverse software versions on primary/standby (operational complexity), and the regulatory acceptance that a multi-minute halt is preferable to incorrect matching.

3. Why CLOB determinism doesn’t create sandwich MEV

Question: Both CLOBs and AMMs are deterministic, but only AMMs enable sandwich attacks. What structural property is different?

Answer: The structural difference is when ordering is finalized relative to when participants can observe pending transactions.

In a CLOB, the sequencer assigns sequence numbers before execution and before any participant can see the pending order. By the time the matching engine processes sequence number N, numbers N-1 and N-1000 are already finalized. No participant can insert a transaction between two already-sequenced messages. The only way to “front-run” is to arrive at the sequencer earlier — a latency race, not a mempool inspection race.

In an AMM, every pending transaction sits in the public mempool before inclusion in a block. Any participant can see a pending swap (“Alice is about to buy 100 ETH, which will move the price 0.5%”), submit a transaction with higher gas fee to be included first, profit from the price move Alice will cause, then let Alice execute at the worse price. The block producer chooses ordering within the block and can extract this value directly (MEV = Maximal Extractable Value).

The key structural property: in a CLOB, the ordering authority (sequencer) and the execution authority (matching engine) are tightly coupled in a single sequential pipeline — ordering is finalized before anyone can react. In a blockchain AMM, ordering authority (block producer) and execution logic (smart contract) are deliberately separated for decentralization — which is precisely what creates the MEV window.

Edmondo's Vault

Explorer

Matching Engine — System Design

Matching Engine — System Design

The problem

Step 1: Requirements

Functional

Non-functional

Step 2: High-level architecture

Components

How data flows between stages

Step 3: The matching engine core

Data structure: the order book

The matching algorithm

Execution policies

Price-time vs pro-rata: incentive effects

Cancel and modify

Complexity analysis

Step 4: The sequencer — making it deterministic

The LMAX thread pipeline

Ring buffer sizing

Step 5: Scaling — one book per instrument

Partition topology in production

Multi-leg and complex orders

Why co-location gives you atomicity for free

Back-of-the-envelope: what fits on one partition?

The three production approaches

The partition boundary = the atomicity boundary

Step 6: Fault tolerance

The journal is the source of truth

Failover

Synchronous replication: dual-gated Disruptors (LMAX)

Batching reconciles throughput with replication latency

Three-network physical separation

Standby failure handling

Preventing split-brain

Failover timing budget

Notifying downstream consumers

Timestamp determinism

Step 7: Gateway and wire protocol

FIX protocol

Binary protocols: SBE and the 100x speedup

Gateway responsibilities

Step 8: Market data

L1: Top-of-book (NBBO)

L2/L3: Depth of book

Interview discussion points

Engineering lessons for system design interviews

Questions to sit with (answered)

1. Scaling beyond a single core without speeding it up

2. Consistency vs availability under extreme conditions

3. Why CLOB determinism doesn’t create sandwich MEV

See also

Graph View

Table of Contents

Backlinks