Outbox Pattern vs Dual Writes: A Practical Reliability Guide

The core problem

Many teams need to write to a database and publish an event. The unsafe approach is dual writes: first write to DB, then publish to broker, or the reverse. This looks fine in local tests but fails under network partitions, process crashes, and broker timeouts. If one write succeeds and the second fails, your system state and event stream diverge.

Why dual writes are fragile

Dual writes fail because there is no single atomic boundary across two independent systems.

DB commit succeeds, broker publish fails
Broker publish succeeds, DB rollback happens
Retry logic publishes duplicate events
Partial failures create hard-to-reconcile drift

Even if failures are rare, they accumulate in production.

What the outbox pattern changes

Outbox pattern stores domain data and event record in the same database transaction.

Step 1: Business write and outbox row are committed atomically
Step 2: A background relay reads pending outbox rows
Step 3: Relay publishes events to broker and marks rows as sent

This converts a distributed atomicity problem into a local transaction plus asynchronous delivery.

Design details that matter

Use explicit status and retry metadata in outbox rows.

<code>event_id</code> for idempotent consumption
<code>aggregate_id</code> and type for routing
<code>payload</code> as immutable JSON
<code>published_at</code>, <code>retry_count</code>, and <code>next_retry_at</code>

Operationally, relay should support batch publish and backoff.

Ordering and exactly-once concerns

Outbox gives at-least-once delivery from relay to broker, so consumers still need idempotency.

Preserve ordering per aggregate where required
Use consumer dedup keys (<code>event_id</code>)
Avoid assuming exactly-once semantics end-to-end

Outbox improves correctness, but it does not remove consumer-side reliability design.

Migration strategy from dual writes

Add outbox table and write path first
Publish from outbox in shadow mode
Compare stream parity with existing producer
Cut over consumers gradually
Decommission dual-write publisher

This reduces risk during transition.

Suggested outbox schema

A minimal schema keeps relay logic predictable and observable.

<code>id</code> (monotonic primary key for batching)
<code>event_id</code> (globally unique dedup key)
<code>aggregate_type</code> and <code>aggregate_id</code>
<code>event_type</code> and <code>payload</code>
<code>status</code> (<code>pending</code>, <code>published</code>, <code>failed</code>)
<code>created_at</code>, <code>published_at</code>, <code>last_error</code>

Index by <code>status</code> and <code>created_at</code> for efficient relay scans.

Relay worker design

Relay is where most real-world issues surface. Keep it resilient.

Poll in small batches with lease/lock semantics
Publish with retry and exponential backoff
Mark success only after broker acknowledgment
Send repeated failures to dead-letter workflow

Relay should be horizontally scalable but preserve per-aggregate ordering where required.

Common anti-patterns

Deleting rows immediately after publish (loses audit trail)
No dedup key in payload contract
Unbounded retries without poison event handling
Running relay in same request path as user API call

Avoiding these mistakes makes outbox behavior stable under incidents.

What teams should monitor

Outbox systems become much easier to trust when the right metrics are visible.

Pending outbox row count
Oldest unsent event age
Relay publish failure rate
Retry distribution by event type

These signals help teams detect delivery drift before it becomes a business issue.

Where outbox gives the biggest value

The outbox pattern is especially useful when the database write is the source of truth and other systems depend on accurate follow-up events. Examples include order creation, payment state changes, user signup flows, and inventory updates.

Final takeaway

Dual writes optimize for short-term simplicity and long-term incidents. Outbox pattern adds controlled complexity but creates a much safer reliability boundary for event-driven systems.