What is the CAP theorem?
CAP theorem states that a distributed system replicating data across multiple nodes cannot simultaneously guarantee Consistency, Availability, and Partition tolerance when a network partition occurs.
Theory
TL;DR
- The trade-off activates only during partitions; in normal operation, target all three properties
- Partition tolerance is non-negotiable in real networks, so the real choice is CP or AP
- CP (ZooKeeper, CockroachDB): blocks or rejects writes during partition to stay consistent
- AP (Cassandra, DynamoDB): always responds, serves stale local data during partition
- CA is a single-node illusion, not a real distributed option
Quick Example
class CAPDemo {
constructor() {
this.nodes = [{ balance: 100 }, { balance: 100 }];
}
// CP: rejects write if replicas have diverged (partition simulation)
cpWrite(amount) {
if (this.nodes[0].balance !== this.nodes[1].balance) {
throw new Error('Partition detected: rejecting write to stay consistent');
}
this.nodes.forEach(n => (n.balance += amount));
return this.nodes[0].balance;
}
// AP: writes to local node, responds regardless of replica state
apWrite(amount, nodeId) {
this.nodes[nodeId].balance += amount;
return this.nodes[nodeId].balance;
}
}
const demo = new CAPDemo();
console.log(demo.cpWrite(50)); // 150 - both nodes updated
demo.nodes[1].balance = 200; // simulate partition: node 1 drifts
console.log(demo.apWrite(10, 0)); // 160 - node 0 responds, but inconsistent with node 1CP refuses to write when replicas disagree. AP writes locally and returns fast. Both are correct behaviors for their respective trade-offs.
CP vs AP in Practice
Network partitions are not edge cases. AWS has documented outages, switches drop packets, cross-datacenter links fail. Since partitions happen, you take Partition tolerance as a given and choose what to give up.
CP systems block or reject operations during a partition rather than risk inconsistency. ZooKeeper stops accepting writes if it loses quorum. CockroachDB uses Raft: no quorum, no commit. This is exactly what distributed locks and financial ledgers need.
AP systems keep responding. Cassandra accepts writes to the local node and propagates them asynchronously. Netflix runs Cassandra for over 100 PB of user event data. A stale recommendation is a minor UX issue. A service timeout is user-facing downtime.
When to Use
- Bank transfers, account balances: CP. A rejected write beats a double-spend.
- Leader election, distributed locks: CP. Kafka uses ZooKeeper, Kubernetes uses etcd, both built on Raft quorums.
- Social feeds, timelines, notifications: AP. Posts appearing two seconds late is acceptable.
- Analytics dashboards, event tracking: AP. Approximate freshness beats downtime.
- Shopping carts with conflict resolution: AP with CRDTs or last-write-wins, which is how Amazon's original Dynamo paper designed it.
Comparison Table
| CP | AP | CA | |
|---|---|---|---|
| Consistency | Yes, waits for quorum | Eventual, stale reads possible | Yes, but assumes no partition |
| Availability | Rejects during partition | Always responds locally | Yes, single-node assumption |
| Partition tolerance | Yes | Yes | No |
| Use case | Locks, ledgers, configs | Feeds, caches, high-volume events | Single-node dev/test only |
| Examples | ZooKeeper, CockroachDB, etcd | Cassandra, DynamoDB (default) | Unpartitioned single Postgres |
How It Works Internally
CP protocols (Raft, Paxos) require a majority quorum before committing any write. The leader sends a proposal to followers, waits for acknowledgment from a majority, then commits. If a partition isolates nodes below majority count, those nodes stop accepting writes. The math: N=3 nodes, quorum=2. One node isolated means two nodes can still commit. Two nodes isolated means the group halts.
AP protocols use gossip-based replication. Each node accepts writes locally and shares state with peers periodically. Version tracking via vector clocks or timestamps handles conflicts after the partition heals. Cassandra's consistency level is tunable: CONSISTENCY QUORUM with RF=3, R=2, W=2 gives you W+R > N, a CP-like guarantee while keeping AP flexibility for lower-priority reads.
I've watched teams deploy Cassandra as the default "scalable database" without thinking through the CAP choice, then spend sprint after sprint debugging stale data in payment flows. Choosing AP is not wrong. Choosing it by accident is.
Common Mistakes
Mistake: Treating CAP as a constant constraint
CAP forces a trade-off only during a partition. Outside of that, run with consistency and availability both. The theorem does not require you to sacrifice consistency on every read.
// Wrong: always block until all replicas confirm
await Promise.all(nodes.map(n => n.write(value))); // one slow node kills latency
// Better: quorum write - majority is enough
const quorum = Math.ceil(nodes.length / 2) + 1;
await Promise.all(nodes.slice(0, quorum).map(n => n.write(value)));Mistake: Selecting CA as a real system design option
CA is theoretical. Real networks partition. MongoDB was labeled CA for years while users observed partition-triggered availability failures. Every production distributed database is CP or AP.
Mistake: Assuming AP requires no conflict handling
AP does not mean "write anywhere, read anywhere, done." Two nodes accepting conflicting writes during a partition produce diverged state. Without last-write-wins, CRDTs, or application-level merge logic, you get data loss or double-spend bugs in fintech.
Mistake: Equating CAP consistency with ACID consistency
CAP's C is linearizability: reads always reflect the latest committed write, as if the system were a single machine. ACID's C means transaction constraints remain valid after a commit. These are different properties. A system can be CAP-consistent without full ACID serializability.
Mistake: Scaling replicas without deciding on CP or AP first
More replicas without a deliberate choice defaults to whatever the database ships with. Cassandra ships as AP. MongoDB replica sets default to CP on primary reads. Know which one you need before you scale.
Real-World Usage
- Cassandra (AP): Netflix event pipeline, 100+ PB.
CONSISTENCY QUORUMfor sensitive reads,ONEfor high-volume writes - ZooKeeper (CP): Kafka broker coordination, distributed lock service
- CockroachDB (CP): serializable isolation for banking clients
- DynamoDB (AP default): AWS e-commerce.
ConsistentRead=trueshifts to CP-like behavior at a latency cost - etcd (CP): Kubernetes cluster state, Raft-based quorum
- MongoDB replica sets (CP primary): Uber trip history; secondary reads trade freshness for throughput
Follow-up Questions
Q: How does PACELC extend CAP?
A: PACELC adds the latency dimension. Even with no partition (E = else), there is a trade-off between latency (L) and consistency (C). DynamoDB optimizes for L during normal operation, not just during failures.
Q: What is the difference between linearizability and eventual consistency?
A: Linearizability (CAP's C) means any read returns the result of the latest write, globally ordered. Eventual consistency means all replicas converge over time, but reads during propagation may return stale data. The gap between them is the AP/CP decision.
Q: How do W, R, and N relate to CAP in quorum systems?
A: With N replicas, setting W+R > N guarantees any read set overlaps with any write set, which gives linearizable reads. Cassandra with RF=3, W=2, R=2: W+R=4 > N=3. Setting W=1, R=1 is pure AP: fast, but stale reads are possible.
Q: How would you design a system that needs both 99.999% availability and consistent money transfers?
A: Use AP as the base layer with async multi-region replication for availability. For the transfer code path only, apply a synchronous quorum write (CP subset). The rest of the system stays available while consistency is enforced exactly where data loss is unacceptable. The Calvin protocol and dependent transactions with prepare-commit phases are production patterns for this.
Q: How do you test CAP properties in CI?
A: Jepsen is the standard tool. It injects network partitions and asserts consistency properties on the final state. It has found real bugs in etcd, Riak, and CockroachDB. For lighter integration testing, Toxiproxy simulates partition and latency between services.
Examples
Basic: CP vs AP during a simulated partition
class ReplicaSet {
constructor() {
this.nodes = [
{ id: 0, value: 0, version: 0 },
{ id: 1, value: 0, version: 0 },
{ id: 2, value: 0, version: 0 },
];
this.partition = false;
}
cpWrite(value) {
if (this.partition) {
throw new Error('CP: partition active, write rejected');
}
this.nodes.forEach(n => { n.value = value; n.version++; });
return value;
}
apWrite(value, nodeId = 0) {
const node = this.nodes[nodeId];
node.value = value;
node.version++;
// propagate to others only when partition is healed
if (!this.partition) {
this.nodes.filter(n => n.id !== nodeId).forEach(n => {
n.value = value;
n.version = node.version;
});
}
return { value, nodeId };
}
read(nodeId = 0) {
return this.nodes[nodeId].value;
}
}
const rs = new ReplicaSet();
rs.cpWrite(100); // all nodes: 100
rs.partition = true;
try {
rs.cpWrite(200); // throws
} catch (e) {
console.log(e.message); // CP: partition active, write rejected
}
rs.apWrite(200, 0);
console.log(rs.read(0)); // 200
console.log(rs.read(1)); // 100 - stale, but system stayed availableAfter the partition heals, AP systems need a reconciliation step. Cassandra uses last-write-wins by timestamp. More complex cases use CRDTs for conflict-free merges.
Intermediate: Tunable quorum (Cassandra-style)
class TunableStore {
constructor(rf = 3) {
this.replicas = new Array(rf).fill(null).map((_, i) => ({
id: i,
data: {},
online: true,
}));
}
async write(key, value, w = 2) {
const available = this.replicas.filter(r => r.online);
available.forEach(r => { r.data[key] = { value, ts: Date.now() }; });
if (available.length < w) {
throw new Error(`Write quorum not met: ${available.length} of ${w} required`);
}
return value;
}
async read(key, r = 2) {
const available = this.replicas.filter(n => n.online && n.data[key]);
if (available.length < r) {
throw new Error(`Read quorum not met: ${available.length} of ${r} required`);
}
return available
.map(n => n.data[key])
.sort((a, b) => b.ts - a.ts)[0].value;
}
}
const store = new TunableStore(3);
await store.write('balance', 500, 2);
store.replicas[2].online = false; // one node goes down
console.log(await store.read('balance', 2)); // 500 - quorum of 2 still works
store.replicas[1].online = false; // second node goes down
try {
await store.read('balance', 2);
} catch (e) {
console.log(e.message); // Read quorum not met: 1 of 2 required
}This is the W+R > N guarantee in action. RF=3, W=2, R=2: any read set overlaps with any write set. Drop to W=1, R=1 and you get pure AP with stale read risk. Choose based on what is worse for your use case: a failed read or a stale one.
Advanced: Partition handling with compensation (fintech pattern)
// Pattern: AP base with CP enforcement for critical operations only
class HybridStore {
constructor() {
this.regions = {
us: { balance: 1000, version: 1, online: true },
eu: { balance: 1000, version: 1, online: true },
};
this.pendingTransfers = [];
}
// AP read: local region, fast, may be slightly behind
read(region) {
return this.regions[region];
}
// CP write for transfers: require all regions, reject if any unreachable
transfer(fromRegion, amount) {
const allOnline = Object.values(this.regions).every(r => r.online);
if (!allOnline) {
this.pendingTransfers.push({ fromRegion, amount, ts: Date.now() });
throw new Error('Partition active: transfer queued for post-heal commit');
}
if (this.regions[fromRegion].balance < amount) {
throw new Error('Insufficient funds');
}
Object.values(this.regions).forEach(r => { r.balance -= amount; r.version++; });
return this.regions[fromRegion].balance;
}
// Heal partition and replay pending transfers
heal() {
Object.values(this.regions).forEach(r => (r.online = true));
const pending = [...this.pendingTransfers];
this.pendingTransfers = [];
return pending.map(t => {
try {
return { ...t, result: this.transfer(t.fromRegion, t.amount) };
} catch (e) {
return { ...t, error: e.message };
}
});
}
}
const store = new HybridStore();
store.regions.eu.online = false; // EU region partitioned
try {
store.transfer('us', 200);
} catch (e) {
console.log(e.message); // Partition active: transfer queued for post-heal commit
}
console.log(store.read('us').balance); // 1000 - AP read still works
const resolved = store.heal();
console.log(resolved); // transfer replayed after partition healed
console.log(store.read('us').balance); // 800This is the pattern behind multi-region banking: AP for reads everywhere, CP enforcement only on the transfer code path. The pending queue with post-heal replay keeps the system available without risking financial data loss during a partition.
Short Answer
Interview readyA concise answer to help you respond confidently on this topic during an interview.