What are noisy tenants?
Noisy tenants are customers in a multi-tenant system who consume a disproportionate share of shared resources - CPU, memory, network, or database connections - causing performance problems for everyone else on the same infrastructure.
Theory
TL;DR
- Analogy: one apartment tenant blasting music and holding the elevator all day, everyone else waits longer
- Main difference: shared infrastructure amplifies the impact; single-tenant setups isolate it per customer
- If one tenant uses more than 50% of shared resources, you have a noisy tenant problem
- Fix options: per-tenant rate limits, resource quotas, or dedicated infrastructure for the heaviest tenants
- Rule of thumb: accept shared risk above 50 tenants with similar usage; isolate when one tenant exceeds 2x the average
Quick example
# AWS Lambda: shared concurrency pool (1000 units account-wide)
# Noisy tenant floods with requests, exhausts pool, others get queued
import boto3
lambda_client = boto3.client('lambda')
def handle_request(tenant_id, payload):
# tenant_A sends 1000 concurrent requests
# all 1000 concurrency units get consumed by tenant_A
response = lambda_client.invoke(
FunctionName='processor',
Payload=payload
)
return response
# tenant_A: processes 1000 requests fine
# tenant_B: "Concurrency limit exceeded" - waits 30+ seconds
# Fix: aws lambda put-function-concurrency --reserved-concurrent-executions 100Tenant A exhausts the shared concurrency pool. Tenant B's requests queue. No code in tenant B's path changed. The problem lives entirely at the infrastructure level.
What makes it a system design problem
In single-tenant setups, one customer's spike hits their own dedicated hardware. Done. In multi-tenant SaaS on AWS or Kubernetes, resources pool across all customers. One tenant spiking to 80% CPU throttles every other tenant through contention.
The tricky part: the noisy tenant usually has no idea they're causing problems. Their requests succeed. The victim tenants see slow responses or errors and file support tickets blaming your product. You're debugging the wrong end of the system.
When to use each strategy
- More than 50 tenants with similar usage patterns: accept the shared risk, add monitoring with Prometheus
- Variable workloads, high-value tenants: per-tenant quotas via API Gateway or a rate limiter in front of your service
- Finance or healthcare with regulated data: strict namespace isolation regardless of usage patterns
- One tenant consistently above 2x average: migrate them to dedicated infrastructure or charge for it
- Startup MVP: shared infrastructure first, then migrate after the first complaints arrive
How it works at the infrastructure level
Kubernetes uses cgroups to limit CPU and memory per pod. A noisy pod hits 100% of its quota and gets throttled automatically. The scheduler can also evict lower-priority pods when node pressure builds. That is where cascade failures come from. One OOMKilled noisy pod triggers eviction of pods from other tenants sharing the same node.
AWS Lambda works differently. The default account-wide concurrency limit is 1000 units. Any function can consume all of them. Setting ReservedConcurrency per tenant caps how many units that tenant can hold, protecting the pool for everyone else.
For databases, the issue shows up in connection pools. Without per-tenant limits, one tenant can open 900 of your 1000 PostgreSQL connections. Other tenants time out trying to connect. pgBouncer with per-tenant pool sizes fixes this.
Common mistakes
No per-tenant quotas in a shared DB connection pool
-- Wrong: one global limit, no per-tenant enforcement
SET max_connections = 1000;
-- Noisy tenant opens 900 connections during a batch job
-- Everyone else: "too many connections" error
-- Fix: pgBouncer with per-tenant pool size
-- max_client_conn = 100 per tenant poolOne tenant running a bulk export job at midnight can silently starve every other tenant until the job finishes.
Shared Redis without namespace isolation
// Wrong: no tenant prefix, noisy tenant runs KEYS *
redis.set('user:123', data);
// KEYS * from one tenant blocks Redis for 10+ seconds
// All other tenants: timeouts on every Redis call
// Correct: prefix by tenant, use SCAN not KEYS
redis.set(`tenant:${tenantId}:user:123`, data);
redis.scan(0, 'MATCH', `tenant:${tenantId}:*`, 'COUNT', 100);Auto-scaling on global CPU metrics
Scaling on cluster-wide CPU means one noisy tenant's spike doubles your infrastructure cost for everyone. Custom per-tenant metrics let you throttle the right tenant instead:
aws cloudwatch put-metric-data \
--namespace Tenants \
--metric-name CPU.TenantA \
--value 85No reserved concurrency in Lambda
Default Lambda setup has no concurrency reservation. One tenant floods with requests, hits the 1000-unit limit, queues every other tenant for 30+ seconds. One command per function prevents this:
aws lambda put-function-concurrency \
--function-name tenant-a-handler \
--reserved-concurrent-executions 100Ignoring bursty tenants during peak traffic
During Black Friday or batch jobs, a tenant can hit 10x their normal usage in seconds. Hard limits via cgroups kill processes reliably but spike latency. Soft limits via throttling are smoother but let the noisy tenant degrade gradually. Neither is perfect. The choice depends on your SLO and whether you prefer a sharp cliff or a slow slope.
Real-world usage
- Kubernetes: namespace ResourceQuota,
resources.limits.cpu: 2per tenant namespace - AWS RDS: parameter groups and RDS Proxy to cap connections per tenant
- Salesforce: governor limits, 100 SOQL queries per transaction per tenant
- RabbitMQ: virtual hosts with queue TTL per tenant
- API Gateway: usage plans with per-tenant throttling limits
Follow-up questions
Q: How do you detect noisy tenants in production?
A: Prometheus query: sum(rate(container_cpu_usage_seconds_total{tenant=~".+"}[5m])) by (tenant). Alert on the top 5% by usage. p99 latency above 500ms per tenant is another reliable signal.
Q: Can you mitigate noisy tenants without changing application code?
A: Yes. Kubernetes ResourceQuotas and LimitRanges handle it at the platform level. Auto-eviction kicks in at 90% node CPU. No application code needs to change.
Q: What is the tradeoff between hard and soft limits?
A: Hard limits via cgroups kill processes when they hit the cap, reliable but causes latency spikes. Soft limits throttle requests gradually, smoother for the noisy tenant but they can still affect neighbors for longer before the limit kicks in.
Q: How do you handle a noisy tenant during a traffic spike like Black Friday?
A: Per-tenant auto-scaling groups and circuit breakers. If a tenant exceeds 2x their average baseline, fall back to dedicated infrastructure for that tenant temporarily.
Q: (Senior) Design a quota system for 10,000 tenants with a 99.99% SLO. What does the architecture look like?
A: Envoy proxy sidecar per tenant with dynamic quota filters. Kafka partitioned by tenant for event isolation. Anomaly detection on per-tenant metrics to catch spikes before they cascade. Separate concurrency pools per pricing tier, not a flat account-wide limit.
Q: How does the noisy tenant problem change in multi-region setups?
A: Global services like DynamoDB propagate the issue through replication lag of 1-5 seconds. Regional queues per tenant contain the blast radius to one region and prevent a noisy tenant in us-east-1 from degrading users in eu-west-1.
Examples
Basic: per-tenant rate limiting in Express
const express = require('express');
const Redis = require('ioredis');
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const redis = new Redis();
const app = express();
const tenantLimiter = rateLimit({
store: new RedisStore({ client: redis }),
keyGenerator: (req) => req.headers['x-tenant-id'], // isolate by tenant
windowMs: 60 * 1000, // 1 minute window
max: 100, // 100 requests per tenant per minute
message: 'Tenant quota exceeded'
});
app.use('/api/', tenantLimiter);
// Noisy tenant: 429 after request 100
// Normal tenant: 200 always, unaffected by the noisy neighborEach tenant gets their own counter in Redis. The noisy tenant hits 429, every other tenant sees 200. This is the minimal version of noisy tenant protection. Cheap to add, and it catches the obvious flood cases before they reach your database.
Kubernetes: namespace quotas and OOM cascade
# ResourceQuota per tenant namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-a-quota
namespace: tenant-a
spec:
hard:
requests.cpu: "2"
requests.memory: 4Gi
limits.cpu: "4"
limits.memory: 8Gi
---
# Noisy pod that ignores limits gets OOMKilled
apiVersion: v1
kind: Pod
metadata:
name: noisy-job
namespace: tenant-a
spec:
containers:
- name: app
image: busybox
resources:
limits:
memory: "64Mi"
command: ["/bin/sh", "-c", "while true; do dd if=/dev/zero bs=1M; done"]
# hits memory limit, gets OOMKilled
# without namespace quota it could trigger eviction of pods from other tenantsThe namespace quota caps what tenant-a can consume cluster-wide. Without it, the OOMKilled pod triggers node-pressure eviction that reaches pods from other tenants on the same node. That cascade is surprisingly hard to debug at 2am when three different tenants are filing incidents simultaneously.
Short Answer
Interview readyA concise answer to help you respond confidently on this topic during an interview.