What is the CAP theorem?

Architecture~6 min read

CAP theorem states that a distributed system replicating data across multiple nodes cannot simultaneously guarantee Consistency, Availability, and Partition tolerance when a network partition occurs.

Theory

TL;DR

The trade-off activates only during partitions; in normal operation, target all three properties
Partition tolerance is non-negotiable in real networks, so the real choice is CP or AP
CP (ZooKeeper, CockroachDB): blocks or rejects writes during partition to stay consistent
AP (Cassandra, DynamoDB): always responds, serves stale local data during partition
CA is a single-node illusion, not a real distributed option

Quick Example

javascript

class CAPDemo {
  constructor() {
    this.nodes = [{ balance: 100 }, { balance: 100 }];
  }

  // CP: rejects write if replicas have diverged (partition simulation)
  cpWrite(amount) {
    if (this.nodes[0].balance !== this.nodes[1].balance) {
      throw new Error('Partition detected: rejecting write to stay consistent');
    }
    this.nodes.forEach(n => (n.balance += amount));
    return this.nodes[0].balance;
  }

  // AP: writes to local node, responds regardless of replica state
  apWrite(amount, nodeId) {
    this.nodes[nodeId].balance += amount;
    return this.nodes[nodeId].balance;
  }
}

const demo = new CAPDemo();
console.log(demo.cpWrite(50));    // 150 - both nodes updated
demo.nodes[1].balance = 200;      // simulate partition: node 1 drifts
console.log(demo.apWrite(10, 0)); // 160 - node 0 responds, but inconsistent with node 1

CP refuses to write when replicas disagree. AP writes locally and returns fast. Both are correct behaviors for their respective trade-offs.

CP vs AP in Practice

Network partitions are not edge cases. AWS has documented outages, switches drop packets, cross-datacenter links fail. Since partitions happen, you take Partition tolerance as a given and choose what to give up.

CP systems block or reject operations during a partition rather than risk inconsistency. ZooKeeper stops accepting writes if it loses quorum. CockroachDB uses Raft: no quorum, no commit. This is exactly what distributed locks and financial ledgers need.

AP systems keep responding. Cassandra accepts writes to the local node and propagates them asynchronously. Netflix runs Cassandra for over 100 PB of user event data. A stale recommendation is a minor UX issue. A service timeout is user-facing downtime.

When to Use

Bank transfers, account balances: CP. A rejected write beats a double-spend.
Leader election, distributed locks: CP. Kafka uses ZooKeeper, Kubernetes uses etcd, both built on Raft quorums.
Social feeds, timelines, notifications: AP. Posts appearing two seconds late is acceptable.
Analytics dashboards, event tracking: AP. Approximate freshness beats downtime.
Shopping carts with conflict resolution: AP with CRDTs or last-write-wins, which is how Amazon's original Dynamo paper designed it.

Comparison Table

	CP	AP	CA
Consistency	Yes, waits for quorum	Eventual, stale reads possible	Yes, but assumes no partition
Availability	Rejects during partition	Always responds locally	Yes, single-node assumption
Partition tolerance	Yes	Yes	No
Use case	Locks, ledgers, configs	Feeds, caches, high-volume events	Single-node dev/test only
Examples	ZooKeeper, CockroachDB, etcd	Cassandra, DynamoDB (default)	Unpartitioned single Postgres

How It Works Internally

CP protocols (Raft, Paxos) require a majority quorum before committing any write. The leader sends a proposal to followers, waits for acknowledgment from a majority, then commits. If a partition isolates nodes below majority count, those nodes stop accepting writes. The math: N=3 nodes, quorum=2. One node isolated means two nodes can still commit. Two nodes isolated means the group halts.

AP protocols use gossip-based replication. Each node accepts writes locally and shares state with peers periodically. Version tracking via vector clocks or timestamps handles conflicts after the partition heals. Cassandra's consistency level is tunable: CONSISTENCY QUORUM with RF=3, R=2, W=2 gives you W+R > N, a CP-like guarantee while keeping AP flexibility for lower-priority reads.

I've watched teams deploy Cassandra as the default "scalable database" without thinking through the CAP choice, then spend sprint after sprint debugging stale data in payment flows. Choosing AP is not wrong. Choosing it by accident is.

Common Mistakes

Mistake: Treating CAP as a constant constraint

CAP forces a trade-off only during a partition. Outside of that, run with consistency and availability both. The theorem does not require you to sacrifice consistency on every read.

javascript

// Wrong: always block until all replicas confirm
await Promise.all(nodes.map(n => n.write(value))); // one slow node kills latency

// Better: quorum write - majority is enough
const quorum = Math.ceil(nodes.length / 2) + 1;
await Promise.all(nodes.slice(0, quorum).map(n => n.write(value)));

Mistake: Selecting CA as a real system design option

CA is theoretical. Real networks partition. MongoDB was labeled CA for years while users observed partition-triggered availability failures. Every production distributed database is CP or AP.

Mistake: Assuming AP requires no conflict handling

AP does not mean "write anywhere, read anywhere, done." Two nodes accepting conflicting writes during a partition produce diverged state. Without last-write-wins, CRDTs, or application-level merge logic, you get data loss or double-spend bugs in fintech.

Mistake: Equating CAP consistency with ACID consistency

CAP's C is linearizability: reads always reflect the latest committed write, as if the system were a single machine. ACID's C means transaction constraints remain valid after a commit. These are different properties. A system can be CAP-consistent without full ACID serializability.

Mistake: Scaling replicas without deciding on CP or AP first

More replicas without a deliberate choice defaults to whatever the database ships with. Cassandra ships as AP. MongoDB replica sets default to CP on primary reads. Know which one you need before you scale.

Real-World Usage

Cassandra (AP): Netflix event pipeline, 100+ PB. CONSISTENCY QUORUM for sensitive reads, ONE for high-volume writes
ZooKeeper (CP): Kafka broker coordination, distributed lock service
CockroachDB (CP): serializable isolation for banking clients
DynamoDB (AP default): AWS e-commerce. ConsistentRead=true shifts to CP-like behavior at a latency cost
etcd (CP): Kubernetes cluster state, Raft-based quorum
MongoDB replica sets (CP primary): Uber trip history; secondary reads trade freshness for throughput

Follow-up Questions

Q: How does PACELC extend CAP?
A: PACELC adds the latency dimension. Even with no partition (E = else), there is a trade-off between latency (L) and consistency (C). DynamoDB optimizes for L during normal operation, not just during failures.

Q: What is the difference between linearizability and eventual consistency?
A: Linearizability (CAP's C) means any read returns the result of the latest write, globally ordered. Eventual consistency means all replicas converge over time, but reads during propagation may return stale data. The gap between them is the AP/CP decision.

Q: How do W, R, and N relate to CAP in quorum systems?
A: With N replicas, setting W+R > N guarantees any read set overlaps with any write set, which gives linearizable reads. Cassandra with RF=3, W=2, R=2: W+R=4 > N=3. Setting W=1, R=1 is pure AP: fast, but stale reads are possible.

Q: How would you design a system that needs both 99.999% availability and consistent money transfers?
A: Use AP as the base layer with async multi-region replication for availability. For the transfer code path only, apply a synchronous quorum write (CP subset). The rest of the system stays available while consistency is enforced exactly where data loss is unacceptable. The Calvin protocol and dependent transactions with prepare-commit phases are production patterns for this.

Q: How do you test CAP properties in CI?
A: Jepsen is the standard tool. It injects network partitions and asserts consistency properties on the final state. It has found real bugs in etcd, Riak, and CockroachDB. For lighter integration testing, Toxiproxy simulates partition and latency between services.

Examples

Basic: CP vs AP during a simulated partition

javascript

class ReplicaSet {
  constructor() {
    this.nodes = [
      { id: 0, value: 0, version: 0 },
      { id: 1, value: 0, version: 0 },
      { id: 2, value: 0, version: 0 },
    ];
    this.partition = false;
  }

  cpWrite(value) {
    if (this.partition) {
      throw new Error('CP: partition active, write rejected');
    }
    this.nodes.forEach(n => { n.value = value; n.version++; });
    return value;
  }

  apWrite(value, nodeId = 0) {
    const node = this.nodes[nodeId];
    node.value = value;
    node.version++;
    // propagate to others only when partition is healed
    if (!this.partition) {
      this.nodes.filter(n => n.id !== nodeId).forEach(n => {
        n.value = value;
        n.version = node.version;
      });
    }
    return { value, nodeId };
  }

  read(nodeId = 0) {
    return this.nodes[nodeId].value;
  }
}

const rs = new ReplicaSet();
rs.cpWrite(100);          // all nodes: 100
rs.partition = true;

try {
  rs.cpWrite(200);        // throws
} catch (e) {
  console.log(e.message); // CP: partition active, write rejected
}

rs.apWrite(200, 0);
console.log(rs.read(0)); // 200
console.log(rs.read(1)); // 100 - stale, but system stayed available

After the partition heals, AP systems need a reconciliation step. Cassandra uses last-write-wins by timestamp. More complex cases use CRDTs for conflict-free merges.

Intermediate: Tunable quorum (Cassandra-style)

javascript

class TunableStore {
  constructor(rf = 3) {
    this.replicas = new Array(rf).fill(null).map((_, i) => ({
      id: i,
      data: {},
      online: true,
    }));
  }

  async write(key, value, w = 2) {
    const available = this.replicas.filter(r => r.online);
    available.forEach(r => { r.data[key] = { value, ts: Date.now() }; });
    if (available.length < w) {
      throw new Error(`Write quorum not met: ${available.length} of ${w} required`);
    }
    return value;
  }

  async read(key, r = 2) {
    const available = this.replicas.filter(n => n.online && n.data[key]);
    if (available.length < r) {
      throw new Error(`Read quorum not met: ${available.length} of ${r} required`);
    }
    return available
      .map(n => n.data[key])
      .sort((a, b) => b.ts - a.ts)[0].value;
  }
}

const store = new TunableStore(3);
await store.write('balance', 500, 2);
store.replicas[2].online = false;     // one node goes down

console.log(await store.read('balance', 2)); // 500 - quorum of 2 still works

store.replicas[1].online = false;     // second node goes down

try {
  await store.read('balance', 2);
} catch (e) {
  console.log(e.message); // Read quorum not met: 1 of 2 required
}

This is the W+R > N guarantee in action. RF=3, W=2, R=2: any read set overlaps with any write set. Drop to W=1, R=1 and you get pure AP with stale read risk. Choose based on what is worse for your use case: a failed read or a stale one.

Advanced: Partition handling with compensation (fintech pattern)

javascript

// Pattern: AP base with CP enforcement for critical operations only
class HybridStore {
  constructor() {
    this.regions = {
      us: { balance: 1000, version: 1, online: true },
      eu: { balance: 1000, version: 1, online: true },
    };
    this.pendingTransfers = [];
  }

  // AP read: local region, fast, may be slightly behind
  read(region) {
    return this.regions[region];
  }

  // CP write for transfers: require all regions, reject if any unreachable
  transfer(fromRegion, amount) {
    const allOnline = Object.values(this.regions).every(r => r.online);
    if (!allOnline) {
      this.pendingTransfers.push({ fromRegion, amount, ts: Date.now() });
      throw new Error('Partition active: transfer queued for post-heal commit');
    }
    if (this.regions[fromRegion].balance < amount) {
      throw new Error('Insufficient funds');
    }
    Object.values(this.regions).forEach(r => { r.balance -= amount; r.version++; });
    return this.regions[fromRegion].balance;
  }

  // Heal partition and replay pending transfers
  heal() {
    Object.values(this.regions).forEach(r => (r.online = true));
    const pending = [...this.pendingTransfers];
    this.pendingTransfers = [];
    return pending.map(t => {
      try {
        return { ...t, result: this.transfer(t.fromRegion, t.amount) };
      } catch (e) {
        return { ...t, error: e.message };
      }
    });
  }
}

const store = new HybridStore();
store.regions.eu.online = false; // EU region partitioned

try {
  store.transfer('us', 200);
} catch (e) {
  console.log(e.message); // Partition active: transfer queued for post-heal commit
}

console.log(store.read('us').balance); // 1000 - AP read still works

const resolved = store.heal();
console.log(resolved);                 // transfer replayed after partition healed
console.log(store.read('us').balance); // 800

This is the pattern behind multi-region banking: AP for reads everywhere, CP enforcement only on the transfer code path. The pending queue with post-heal replay keeps the system available without risking financial data loss during a partition.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Finished reading?