Suggest an editImprove this articleRefine the answer for “What is a dead letter queue?”. Your changes go to moderation before they’re published.Approval requiredContentWhat you’re changing🇺🇸EN🇺🇦UAPreviewTitle (EN)Short answer (EN)**Dead letter queue (DLQ)** - a separate queue that stores messages a consumer failed to process after all retry attempts. Instead of blocking the main queue or losing data, failed messages move to DLQ for inspection and reprocessing (re-drive). **Key:** DLQ does not fix the root cause. It isolates broken messages so the main flow keeps running.Shown above the full answer for quick recall.Answer (EN)Image**Dead letter queue (DLQ)** - a separate queue that stores messages a consumer could not process after exhausting all retry attempts. ## Theory ### TL;DR - DLQ captures failed messages so the main queue keeps moving - It does not fix the root cause - it holds messages for inspection and reprocessing - A message lands in DLQ after hitting the retry limit (typically 3-5 attempts with exponential backoff) - Sending messages from DLQ back to the original queue is called **re-drive** - Common failure triggers: malformed payload, downstream service outage, schema mismatch, missing field ### Quick Example Two microservices communicate through a queue. `Payments` publishes an event after a successful charge. `Subscriptions` listens, creates a subscription record, and sends a welcome email via a third-party provider. ```json // Normal flow PaymentSucceeded { paymentId, userId, planId, amount, occurredAt } -> Subscriptions service: creates subscription + sends welcome email // Email provider returns 502 Subscriptions service: attempt 1 -> 502, wait 30s attempt 2 -> 502, wait 60s attempt 3 -> 502, wait 120s -> max retries exceeded -> message moves to DLQ // Provider recovers DLQ re-drive -> message returns to main queue -> processed successfully ``` The message is not lost. It parks in DLQ until someone, or an automated process, handles it. ### Why Not Retry Forever Infinite retries block the queue. If a message keeps failing, every consumer slot stays occupied on it, and new messages pile up behind. DLQ solves this by moving the broken message aside after N attempts. There is also the poison pill problem. A message that looks valid but keeps crashing your consumer can take down the entire processing pipeline. Without DLQ, that one message stalls everything. With DLQ, it gets isolated after the retry limit and the rest of the queue moves on. ### How the Retry-to-DLQ Flow Works The consumer receives a message and tries to process it. On failure, it nacks (negative acknowledgment) the message. The broker re-queues it. After the configured retry limit, the broker moves the message to the DLQ instead of re-queuing it again. The configuration varies by platform: - **AWS SQS**: set `RedrivePolicy` with `maxReceiveCount` and point `deadLetterTargetArn` at your DLQ - **RabbitMQ**: use `x-dead-letter-exchange` and `x-max-redeliveries` on the source queue - **Apache Kafka**: no native DLQ, so you publish failed messages to a `*.DLT` topic manually (Spring Kafka and Confluent follow this convention) ### DLQ Re-drive Re-drive means sending messages from the DLQ back to the original queue for reprocessing. You do this after fixing the bug that caused failures in the first place. AWS SQS Console has a built-in re-drive button. For RabbitMQ and Kafka you typically write a small script or use a management plugin. Before you re-drive, inspect the message payload. Sometimes the failure is a bad message (wrong schema, missing field) and those will not succeed on retry no matter how many times you try. Fix the message or discard it. ### Common Mistakes **Setting maxReceiveCount too low.** If you set it to 1, a transient network hiccup sends the message straight to DLQ. Start at 3-5 retries with exponential backoff. **Not monitoring DLQ size.** A DLQ that silently fills up is the same as losing messages. I've seen teams discover weeks of failed orders sitting in a DLQ only after a customer complaint. Set an alarm on DLQ depth. If it goes above zero, someone should know. **Re-driving without fixing the bug first.** Re-driving messages before the consumer is actually fixed just moves them back to DLQ again. Confirm the fix, then re-drive. **One DLQ for everything.** In larger systems, mixing messages from unrelated services in one DLQ makes debugging harder. Each service, or at least each queue, should have its own DLQ. **Ignoring message ordering.** If your main queue is FIFO and DLQ is not, re-driven messages will not be in the original order. This matters for financial or audit workflows. ### Where You'll See This - AWS SQS with Lambda or ECS consumers - RabbitMQ in Node.js services (amqplib, NestJS queues) - Apache Kafka with kafkajs or Spring Boot - Google Cloud Pub/Sub (called "dead letter topic" there) - Azure Service Bus (has a built-in dead-letter subqueue) ### Follow-up Questions **Q:** What's the difference between a DLQ and a retry queue? **A:** A retry queue is temporary - it holds a message while waiting to attempt processing again, usually with a delay. A DLQ is the final stop after all retries fail. Some systems combine both: retry queue first, DLQ after all retries are exhausted. **Q:** How do you decide the right maxReceiveCount? **A:** It depends on what kind of failures you expect. Transient issues like network blips usually resolve in 1-2 retries. A downstream outage might need 5+. Most teams start at 3-5 and tune based on DLQ metrics in production. **Q:** Can a DLQ have its own DLQ? **A:** No, and you don't want that recursion. AWS SQS actively prevents it. The DLQ is the terminal stop. **Q:** What happens if your DLQ is full and a new failed message arrives? **A:** Depends on the broker. SQS rejects the message if the DLQ is at capacity. RabbitMQ can drop it based on queue settings. Either way, the message is lost - another reason to monitor DLQ depth proactively. **Q:** Is a DLQ the same as the poison pill pattern? **A:** Related but different. A poison pill is a specific message that repeatedly crashes a consumer. DLQ is the infrastructure that catches those messages after retries are exhausted. The poison pill is the problem; DLQ is part of the solution. ## Examples ### Payments and Subscriptions Service This is the scenario you will explain in most interviews about DLQs. The SQS queue has `RedrivePolicy` set to `maxReceiveCount: 3`. If the email provider call throws three times, SQS automatically moves the message to the DLQ - no consumer code change needed. ```javascript // subscriptions-consumer.js const { SQSClient, DeleteMessageCommand } = require('@aws-sdk/client-sqs'); const client = new SQSClient({ region: 'us-east-1' }); async function processPaymentEvent(message) { const { paymentId, userId, planId } = JSON.parse(message.Body); // Create subscription record in DB await db.subscriptions.create({ userId, planId, paymentId }); // Send welcome email - if this throws, SQS will redeliver the message // After maxReceiveCount failures, SQS moves the message to the DLQ await emailProvider.sendWelcomeEmail({ userId, planId }); // Only delete the message after full success await client.send(new DeleteMessageCommand({ QueueUrl: process.env.MAIN_QUEUE_URL, ReceiptHandle: message.ReceiptHandle, })); } ``` Notice that the delete only happens at the end. If `emailProvider.sendWelcomeEmail` throws, the message is not deleted and SQS counts it as a failed delivery. ### Manual DLQ in Kafka Kafka has no native DLQ, so you publish failed messages to a separate topic yourself. The convention is to append `.DLT` to the original topic name. ```javascript // kafka-consumer.js const { Kafka } = require('kafkajs'); const kafka = new Kafka({ brokers: ['localhost:9092'] }); const consumer = kafka.consumer({ groupId: 'subscriptions-group' }); const producer = kafka.producer(); await consumer.run({ eachMessage: async ({ topic, message }) => { try { await processMessage(JSON.parse(message.value.toString())); } catch (error) { // Publish to dead letter topic - original-topic-name.DLT by convention await producer.send({ topic: `${topic}.DLT`, messages: [{ value: message.value, headers: { 'x-original-topic': topic, 'x-error-message': error.message, 'x-failed-at': Date.now().toString(), }, }], }); } }, }); ``` The headers are the important part here. When you inspect the DLT later, you know exactly which topic the message came from and why it failed. That context saves a lot of debugging time. ### Monitoring DLQ Depth with CloudWatch A DLQ is only useful if someone notices when it has messages. Here is a minimal CloudWatch alarm in CDK: ```typescript import { Alarm } from 'aws-cdk-lib/aws-cloudwatch'; import { Queue } from 'aws-cdk-lib/aws-sqs'; const dlq = new Queue(this, 'PaymentEventsDLQ'); new Alarm(this, 'DLQNotEmpty', { metric: dlq.metricApproximateNumberOfMessagesVisible(), threshold: 1, // alert on any single message evaluationPeriods: 1, alarmDescription: 'Payment events DLQ has messages - check consumer logs', }); ``` Set this up before going to production. A DLQ without an alarm is infrastructure theater.For the reviewerNote to the moderator (optional)Visible only to the moderator. Helps review go faster.