Suggest an edit

Improve this article

Refine the answer for “What is a dead letter queue?”. Your changes go to moderation before they’re published.

Approval required

Content

What you’re changing

Title (EN)

Short answer (EN)

Shown above the full answer for quick recall.

Answer (EN)

**Dead letter queue (DLQ)** - a separate queue that stores messages a consumer could not process after exhausting all retry attempts.

## Theory

### TL;DR

- DLQ captures failed messages so the main queue keeps moving
- It does not fix the root cause - it holds messages for inspection and reprocessing
- A message lands in DLQ after hitting the retry limit (typically 3-5 attempts with exponential backoff)
- Sending messages from DLQ back to the original queue is called **re-drive**
- Common failure triggers: malformed payload, downstream service outage, schema mismatch, missing field

### Quick Example

Two microservices communicate through a queue. `Payments` publishes an event after a successful charge. `Subscriptions` listens, creates a subscription record, and sends a welcome email via a third-party provider.

```json
// Normal flow
PaymentSucceeded { paymentId, userId, planId, amount, occurredAt }
  -> Subscriptions service: creates subscription + sends welcome email

// Email provider returns 502
Subscriptions service:
  attempt 1 -> 502, wait 30s
  attempt 2 -> 502, wait 60s
  attempt 3 -> 502, wait 120s
  -> max retries exceeded -> message moves to DLQ

// Provider recovers
DLQ re-drive -> message returns to main queue -> processed successfully
```

The message is not lost. It parks in DLQ until someone, or an automated process, handles it.

### Why Not Retry Forever

Infinite retries block the queue. If a message keeps failing, every consumer slot stays occupied on it, and new messages pile up behind. DLQ solves this by moving the broken message aside after N attempts.

There is also the poison pill problem. A message that looks valid but keeps crashing your consumer can take down the entire processing pipeline. Without DLQ, that one message stalls everything. With DLQ, it gets isolated after the retry limit and the rest of the queue moves on.

### How the Retry-to-DLQ Flow Works

The consumer receives a message and tries to process it. On failure, it nacks (negative acknowledgment) the message. The broker re-queues it. After the configured retry limit, the broker moves the message to the DLQ instead of re-queuing it again.

The configuration varies by platform:

- **AWS SQS**: set `RedrivePolicy` with `maxReceiveCount` and point `deadLetterTargetArn` at your DLQ
- **RabbitMQ**: use `x-dead-letter-exchange` and `x-max-redeliveries` on the source queue
- **Apache Kafka**: no native DLQ, so you publish failed messages to a `*.DLT` topic manually (Spring Kafka and Confluent follow this convention)

### DLQ Re-drive

Re-drive means sending messages from the DLQ back to the original queue for reprocessing. You do this after fixing the bug that caused failures in the first place.

AWS SQS Console has a built-in re-drive button. For RabbitMQ and Kafka you typically write a small script or use a management plugin.

Before you re-drive, inspect the message payload. Sometimes the failure is a bad message (wrong schema, missing field) and those will not succeed on retry no matter how many times you try. Fix the message or discard it.

### Common Mistakes

**Setting maxReceiveCount too low.** If you set it to 1, a transient network hiccup sends the message straight to DLQ. Start at 3-5 retries with exponential backoff.

**Not monitoring DLQ size.** A DLQ that silently fills up is the same as losing messages. I've seen teams discover weeks of failed orders sitting in a DLQ only after a customer complaint. Set an alarm on DLQ depth. If it goes above zero, someone should know.

**Re-driving without fixing the bug first.** Re-driving messages before the consumer is actually fixed just moves them back to DLQ again. Confirm the fix, then re-drive.

**One DLQ for everything.** In larger systems, mixing messages from unrelated services in one DLQ makes debugging harder. Each service, or at least each queue, should have its own DLQ.

**Ignoring message ordering.** If your main queue is FIFO and DLQ is not, re-driven messages will not be in the original order. This matters for financial or audit workflows.

### Where You'll See This

- AWS SQS with Lambda or ECS consumers
- RabbitMQ in Node.js services (amqplib, NestJS queues)
- Apache Kafka with kafkajs or Spring Boot
- Google Cloud Pub/Sub (called "dead letter topic" there)
- Azure Service Bus (has a built-in dead-letter subqueue)

### Follow-up Questions

**Q:** What's the difference between a DLQ and a retry queue?
**A:** A retry queue is temporary - it holds a message while waiting to attempt processing again, usually with a delay. A DLQ is the final stop after all retries fail. Some systems combine both: retry queue first, DLQ after all retries are exhausted.

**Q:** How do you decide the right maxReceiveCount?
**A:** It depends on what kind of failures you expect. Transient issues like network blips usually resolve in 1-2 retries. A downstream outage might need 5+. Most teams start at 3-5 and tune based on DLQ metrics in production.

**Q:** Can a DLQ have its own DLQ?
**A:** No, and you don't want that recursion. AWS SQS actively prevents it. The DLQ is the terminal stop.

**Q:** What happens if your DLQ is full and a new failed message arrives?
**A:** Depends on the broker. SQS rejects the message if the DLQ is at capacity. RabbitMQ can drop it based on queue settings. Either way, the message is lost - another reason to monitor DLQ depth proactively.

**Q:** Is a DLQ the same as the poison pill pattern?
**A:** Related but different. A poison pill is a specific message that repeatedly crashes a consumer. DLQ is the infrastructure that catches those messages after retries are exhausted. The poison pill is the problem; DLQ is part of the solution.

## Examples

### Payments and Subscriptions Service

This is the scenario you will explain in most interviews about DLQs. The SQS queue has `RedrivePolicy` set to `maxReceiveCount: 3`. If the email provider call throws three times, SQS automatically moves the message to the DLQ - no consumer code change needed.

```javascript
// subscriptions-consumer.js
const { SQSClient, DeleteMessageCommand } = require('@aws-sdk/client-sqs');

const client = new SQSClient({ region: 'us-east-1' });

async function processPaymentEvent(message) {
  const { paymentId, userId, planId } = JSON.parse(message.Body);

// Create subscription record in DB
  await db.subscriptions.create({ userId, planId, paymentId });

// Send welcome email - if this throws, SQS will redeliver the message
  // After maxReceiveCount failures, SQS moves the message to the DLQ
  await emailProvider.sendWelcomeEmail({ userId, planId });

// Only delete the message after full success
  await client.send(new DeleteMessageCommand({
    QueueUrl: process.env.MAIN_QUEUE_URL,
    ReceiptHandle: message.ReceiptHandle,
  }));
}
```

Notice that the delete only happens at the end. If `emailProvider.sendWelcomeEmail` throws, the message is not deleted and SQS counts it as a failed delivery.

### Manual DLQ in Kafka

Kafka has no native DLQ, so you publish failed messages to a separate topic yourself. The convention is to append `.DLT` to the original topic name.

```javascript
// kafka-consumer.js
const { Kafka } = require('kafkajs');

const kafka = new Kafka({ brokers: ['localhost:9092'] });
const consumer = kafka.consumer({ groupId: 'subscriptions-group' });
const producer = kafka.producer();

await consumer.run({
  eachMessage: async ({ topic, message }) => {
    try {
      await processMessage(JSON.parse(message.value.toString()));
    } catch (error) {
      // Publish to dead letter topic - original-topic-name.DLT by convention
      await producer.send({
        topic: `${topic}.DLT`,
        messages: [{
          value: message.value,
          headers: {
            'x-original-topic': topic,
            'x-error-message': error.message,
            'x-failed-at': Date.now().toString(),
          },
        }],
      });
    }
  },
});
```

The headers are the important part here. When you inspect the DLT later, you know exactly which topic the message came from and why it failed. That context saves a lot of debugging time.

### Monitoring DLQ Depth with CloudWatch

A DLQ is only useful if someone notices when it has messages. Here is a minimal CloudWatch alarm in CDK:

```typescript
import { Alarm } from 'aws-cdk-lib/aws-cloudwatch';
import { Queue } from 'aws-cdk-lib/aws-sqs';

const dlq = new Queue(this, 'PaymentEventsDLQ');

new Alarm(this, 'DLQNotEmpty', {
  metric: dlq.metricApproximateNumberOfMessagesVisible(),
  threshold: 1,          // alert on any single message
  evaluationPeriods: 1,
  alarmDescription: 'Payment events DLQ has messages - check consumer logs',
});
```

Set this up before going to production. A DLQ without an alarm is infrastructure theater.

Markdown · drag & drop images · ⌘B / ⌘I shortcuts1330 words

For the reviewer

Note to the moderator (optional)

Visible only to the moderator. Helps review go faster.