What are noisy tenants?

General Questions~5 min read

Noisy tenants are customers in a multi-tenant system who consume a disproportionate share of shared resources - CPU, memory, network, or database connections - causing performance problems for everyone else on the same infrastructure.

Theory

TL;DR

Analogy: one apartment tenant blasting music and holding the elevator all day, everyone else waits longer
Main difference: shared infrastructure amplifies the impact; single-tenant setups isolate it per customer
If one tenant uses more than 50% of shared resources, you have a noisy tenant problem
Fix options: per-tenant rate limits, resource quotas, or dedicated infrastructure for the heaviest tenants
Rule of thumb: accept shared risk above 50 tenants with similar usage; isolate when one tenant exceeds 2x the average

Quick example

python

# AWS Lambda: shared concurrency pool (1000 units account-wide)
# Noisy tenant floods with requests, exhausts pool, others get queued

import boto3

lambda_client = boto3.client('lambda')

def handle_request(tenant_id, payload):
    # tenant_A sends 1000 concurrent requests
    # all 1000 concurrency units get consumed by tenant_A
    response = lambda_client.invoke(
        FunctionName='processor',
        Payload=payload
    )
    return response

# tenant_A: processes 1000 requests fine
# tenant_B: "Concurrency limit exceeded" - waits 30+ seconds
# Fix: aws lambda put-function-concurrency --reserved-concurrent-executions 100

Tenant A exhausts the shared concurrency pool. Tenant B's requests queue. No code in tenant B's path changed. The problem lives entirely at the infrastructure level.

What makes it a system design problem

In single-tenant setups, one customer's spike hits their own dedicated hardware. Done. In multi-tenant SaaS on AWS or Kubernetes, resources pool across all customers. One tenant spiking to 80% CPU throttles every other tenant through contention.

The tricky part: the noisy tenant usually has no idea they're causing problems. Their requests succeed. The victim tenants see slow responses or errors and file support tickets blaming your product. You're debugging the wrong end of the system.

When to use each strategy

More than 50 tenants with similar usage patterns: accept the shared risk, add monitoring with Prometheus
Variable workloads, high-value tenants: per-tenant quotas via API Gateway or a rate limiter in front of your service
Finance or healthcare with regulated data: strict namespace isolation regardless of usage patterns
One tenant consistently above 2x average: migrate them to dedicated infrastructure or charge for it
Startup MVP: shared infrastructure first, then migrate after the first complaints arrive

How it works at the infrastructure level

Kubernetes uses cgroups to limit CPU and memory per pod. A noisy pod hits 100% of its quota and gets throttled automatically. The scheduler can also evict lower-priority pods when node pressure builds. That is where cascade failures come from. One OOMKilled noisy pod triggers eviction of pods from other tenants sharing the same node.

AWS Lambda works differently. The default account-wide concurrency limit is 1000 units. Any function can consume all of them. Setting ReservedConcurrency per tenant caps how many units that tenant can hold, protecting the pool for everyone else.

For databases, the issue shows up in connection pools. Without per-tenant limits, one tenant can open 900 of your 1000 PostgreSQL connections. Other tenants time out trying to connect. pgBouncer with per-tenant pool sizes fixes this.

Common mistakes

No per-tenant quotas in a shared DB connection pool

sql

-- Wrong: one global limit, no per-tenant enforcement
SET max_connections = 1000;
-- Noisy tenant opens 900 connections during a batch job
-- Everyone else: "too many connections" error

-- Fix: pgBouncer with per-tenant pool size
-- max_client_conn = 100 per tenant pool

One tenant running a bulk export job at midnight can silently starve every other tenant until the job finishes.

Shared Redis without namespace isolation

javascript

// Wrong: no tenant prefix, noisy tenant runs KEYS *
redis.set('user:123', data);
// KEYS * from one tenant blocks Redis for 10+ seconds
// All other tenants: timeouts on every Redis call

// Correct: prefix by tenant, use SCAN not KEYS
redis.set(`tenant:${tenantId}:user:123`, data);
redis.scan(0, 'MATCH', `tenant:${tenantId}:*`, 'COUNT', 100);

Auto-scaling on global CPU metrics

Scaling on cluster-wide CPU means one noisy tenant's spike doubles your infrastructure cost for everyone. Custom per-tenant metrics let you throttle the right tenant instead:

bash

aws cloudwatch put-metric-data \
  --namespace Tenants \
  --metric-name CPU.TenantA \
  --value 85

No reserved concurrency in Lambda

Default Lambda setup has no concurrency reservation. One tenant floods with requests, hits the 1000-unit limit, queues every other tenant for 30+ seconds. One command per function prevents this:

bash

aws lambda put-function-concurrency \
  --function-name tenant-a-handler \
  --reserved-concurrent-executions 100

Ignoring bursty tenants during peak traffic

During Black Friday or batch jobs, a tenant can hit 10x their normal usage in seconds. Hard limits via cgroups kill processes reliably but spike latency. Soft limits via throttling are smoother but let the noisy tenant degrade gradually. Neither is perfect. The choice depends on your SLO and whether you prefer a sharp cliff or a slow slope.

Real-world usage

Kubernetes: namespace ResourceQuota, resources.limits.cpu: 2 per tenant namespace
AWS RDS: parameter groups and RDS Proxy to cap connections per tenant
Salesforce: governor limits, 100 SOQL queries per transaction per tenant
RabbitMQ: virtual hosts with queue TTL per tenant
API Gateway: usage plans with per-tenant throttling limits

Follow-up questions

Q: How do you detect noisy tenants in production?
A: Prometheus query: sum(rate(container_cpu_usage_seconds_total{tenant=~".+"}[5m])) by (tenant). Alert on the top 5% by usage. p99 latency above 500ms per tenant is another reliable signal.

Q: Can you mitigate noisy tenants without changing application code?
A: Yes. Kubernetes ResourceQuotas and LimitRanges handle it at the platform level. Auto-eviction kicks in at 90% node CPU. No application code needs to change.

Q: What is the tradeoff between hard and soft limits?
A: Hard limits via cgroups kill processes when they hit the cap, reliable but causes latency spikes. Soft limits throttle requests gradually, smoother for the noisy tenant but they can still affect neighbors for longer before the limit kicks in.

Q: How do you handle a noisy tenant during a traffic spike like Black Friday?
A: Per-tenant auto-scaling groups and circuit breakers. If a tenant exceeds 2x their average baseline, fall back to dedicated infrastructure for that tenant temporarily.

Q: (Senior) Design a quota system for 10,000 tenants with a 99.99% SLO. What does the architecture look like?
A: Envoy proxy sidecar per tenant with dynamic quota filters. Kafka partitioned by tenant for event isolation. Anomaly detection on per-tenant metrics to catch spikes before they cascade. Separate concurrency pools per pricing tier, not a flat account-wide limit.

Q: How does the noisy tenant problem change in multi-region setups?
A: Global services like DynamoDB propagate the issue through replication lag of 1-5 seconds. Regional queues per tenant contain the blast radius to one region and prevent a noisy tenant in us-east-1 from degrading users in eu-west-1.

Examples

Basic: per-tenant rate limiting in Express

javascript

const express = require('express');
const Redis = require('ioredis');
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');

const redis = new Redis();
const app = express();

const tenantLimiter = rateLimit({
  store: new RedisStore({ client: redis }),
  keyGenerator: (req) => req.headers['x-tenant-id'], // isolate by tenant
  windowMs: 60 * 1000, // 1 minute window
  max: 100,            // 100 requests per tenant per minute
  message: 'Tenant quota exceeded'
});

app.use('/api/', tenantLimiter);

// Noisy tenant: 429 after request 100
// Normal tenant: 200 always, unaffected by the noisy neighbor

Each tenant gets their own counter in Redis. The noisy tenant hits 429, every other tenant sees 200. This is the minimal version of noisy tenant protection. Cheap to add, and it catches the obvious flood cases before they reach your database.

Kubernetes: namespace quotas and OOM cascade

yaml

# ResourceQuota per tenant namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
---
# Noisy pod that ignores limits gets OOMKilled
apiVersion: v1
kind: Pod
metadata:
  name: noisy-job
  namespace: tenant-a
spec:
  containers:
  - name: app
    image: busybox
    resources:
      limits:
        memory: "64Mi"
    command: ["/bin/sh", "-c", "while true; do dd if=/dev/zero bs=1M; done"]
    # hits memory limit, gets OOMKilled
    # without namespace quota it could trigger eviction of pods from other tenants

The namespace quota caps what tenant-a can consume cluster-wide. Without it, the OOMKilled pod triggers node-pressure eviction that reaches pods from other tenants on the same node. That cascade is surprisingly hard to debug at 2am when three different tenants are filing incidents simultaneously.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Finished reading?