Suggest an edit

Improve this article

Refine the answer for “What are noisy tenants?”. Your changes go to moderation before they’re published.

Approval required

Content

What you’re changing

Title (EN)

Short answer (EN)

Shown above the full answer for quick recall.

Answer (EN)

**Noisy tenants** are customers in a multi-tenant system who consume a disproportionate share of shared resources - CPU, memory, network, or database connections - causing performance problems for everyone else on the same infrastructure.

## Theory

### TL;DR

- Analogy: one apartment tenant blasting music and holding the elevator all day, everyone else waits longer
- Main difference: shared infrastructure amplifies the impact; single-tenant setups isolate it per customer
- If one tenant uses more than 50% of shared resources, you have a noisy tenant problem
- Fix options: per-tenant rate limits, resource quotas, or dedicated infrastructure for the heaviest tenants
- Rule of thumb: accept shared risk above 50 tenants with similar usage; isolate when one tenant exceeds 2x the average

### Quick example

```python
# AWS Lambda: shared concurrency pool (1000 units account-wide)
# Noisy tenant floods with requests, exhausts pool, others get queued

import boto3

lambda_client = boto3.client('lambda')

def handle_request(tenant_id, payload):
    # tenant_A sends 1000 concurrent requests
    # all 1000 concurrency units get consumed by tenant_A
    response = lambda_client.invoke(
        FunctionName='processor',
        Payload=payload
    )
    return response

# tenant_A: processes 1000 requests fine
# tenant_B: "Concurrency limit exceeded" - waits 30+ seconds
# Fix: aws lambda put-function-concurrency --reserved-concurrent-executions 100
```

Tenant A exhausts the shared concurrency pool. Tenant B's requests queue. No code in tenant B's path changed. The problem lives entirely at the infrastructure level.

### What makes it a system design problem

In single-tenant setups, one customer's spike hits their own dedicated hardware. Done. In multi-tenant SaaS on AWS or Kubernetes, resources pool across all customers. One tenant spiking to 80% CPU throttles every other tenant through contention.

The tricky part: the noisy tenant usually has no idea they're causing problems. Their requests succeed. The victim tenants see slow responses or errors and file support tickets blaming your product. You're debugging the wrong end of the system.

### When to use each strategy

- More than 50 tenants with similar usage patterns: accept the shared risk, add monitoring with Prometheus
- Variable workloads, high-value tenants: per-tenant quotas via API Gateway or a rate limiter in front of your service
- Finance or healthcare with regulated data: strict namespace isolation regardless of usage patterns
- One tenant consistently above 2x average: migrate them to dedicated infrastructure or charge for it
- Startup MVP: shared infrastructure first, then migrate after the first complaints arrive

### How it works at the infrastructure level

Kubernetes uses cgroups to limit CPU and memory per pod. A noisy pod hits 100% of its quota and gets throttled automatically. The scheduler can also evict lower-priority pods when node pressure builds. That is where cascade failures come from. One OOMKilled noisy pod triggers eviction of pods from other tenants sharing the same node.

AWS Lambda works differently. The default account-wide concurrency limit is 1000 units. Any function can consume all of them. Setting `ReservedConcurrency` per tenant caps how many units that tenant can hold, protecting the pool for everyone else.

For databases, the issue shows up in connection pools. Without per-tenant limits, one tenant can open 900 of your 1000 PostgreSQL connections. Other tenants time out trying to connect. pgBouncer with per-tenant pool sizes fixes this.

### Common mistakes

**No per-tenant quotas in a shared DB connection pool**

```sql
-- Wrong: one global limit, no per-tenant enforcement
SET max_connections = 1000;
-- Noisy tenant opens 900 connections during a batch job
-- Everyone else: "too many connections" error

-- Fix: pgBouncer with per-tenant pool size
-- max_client_conn = 100 per tenant pool
```

One tenant running a bulk export job at midnight can silently starve every other tenant until the job finishes.

**Shared Redis without namespace isolation**

```javascript
// Wrong: no tenant prefix, noisy tenant runs KEYS *
redis.set('user:123', data);
// KEYS * from one tenant blocks Redis for 10+ seconds
// All other tenants: timeouts on every Redis call

// Correct: prefix by tenant, use SCAN not KEYS
redis.set(`tenant:${tenantId}:user:123`, data);
redis.scan(0, 'MATCH', `tenant:${tenantId}:*`, 'COUNT', 100);
```

**Auto-scaling on global CPU metrics**

Scaling on cluster-wide CPU means one noisy tenant's spike doubles your infrastructure cost for everyone. Custom per-tenant metrics let you throttle the right tenant instead:

```bash
aws cloudwatch put-metric-data \
  --namespace Tenants \
  --metric-name CPU.TenantA \
  --value 85
```

**No reserved concurrency in Lambda**

Default Lambda setup has no concurrency reservation. One tenant floods with requests, hits the 1000-unit limit, queues every other tenant for 30+ seconds. One command per function prevents this:

```bash
aws lambda put-function-concurrency \
  --function-name tenant-a-handler \
  --reserved-concurrent-executions 100
```

**Ignoring bursty tenants during peak traffic**

During Black Friday or batch jobs, a tenant can hit 10x their normal usage in seconds. Hard limits via cgroups kill processes reliably but spike latency. Soft limits via throttling are smoother but let the noisy tenant degrade gradually. Neither is perfect. The choice depends on your SLO and whether you prefer a sharp cliff or a slow slope.

### Real-world usage

- **Kubernetes**: namespace ResourceQuota, `resources.limits.cpu: 2` per tenant namespace
- **AWS RDS**: parameter groups and RDS Proxy to cap connections per tenant
- **Salesforce**: governor limits, 100 SOQL queries per transaction per tenant
- **RabbitMQ**: virtual hosts with queue TTL per tenant
- **API Gateway**: usage plans with per-tenant throttling limits

### Follow-up questions

**Q:** How do you detect noisy tenants in production?
**A:** Prometheus query: `sum(rate(container_cpu_usage_seconds_total{tenant=~".+"}[5m])) by (tenant)`. Alert on the top 5% by usage. p99 latency above 500ms per tenant is another reliable signal.

**Q:** Can you mitigate noisy tenants without changing application code?
**A:** Yes. Kubernetes ResourceQuotas and LimitRanges handle it at the platform level. Auto-eviction kicks in at 90% node CPU. No application code needs to change.

**Q:** What is the tradeoff between hard and soft limits?
**A:** Hard limits via cgroups kill processes when they hit the cap, reliable but causes latency spikes. Soft limits throttle requests gradually, smoother for the noisy tenant but they can still affect neighbors for longer before the limit kicks in.

**Q:** How do you handle a noisy tenant during a traffic spike like Black Friday?
**A:** Per-tenant auto-scaling groups and circuit breakers. If a tenant exceeds 2x their average baseline, fall back to dedicated infrastructure for that tenant temporarily.

**Q:** (Senior) Design a quota system for 10,000 tenants with a 99.99% SLO. What does the architecture look like?
**A:** Envoy proxy sidecar per tenant with dynamic quota filters. Kafka partitioned by tenant for event isolation. Anomaly detection on per-tenant metrics to catch spikes before they cascade. Separate concurrency pools per pricing tier, not a flat account-wide limit.

**Q:** How does the noisy tenant problem change in multi-region setups?
**A:** Global services like DynamoDB propagate the issue through replication lag of 1-5 seconds. Regional queues per tenant contain the blast radius to one region and prevent a noisy tenant in us-east-1 from degrading users in eu-west-1.

## Examples

### Basic: per-tenant rate limiting in Express

```javascript
const express = require('express');
const Redis = require('ioredis');
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');

const redis = new Redis();
const app = express();

const tenantLimiter = rateLimit({
  store: new RedisStore({ client: redis }),
  keyGenerator: (req) => req.headers['x-tenant-id'], // isolate by tenant
  windowMs: 60 * 1000, // 1 minute window
  max: 100,            // 100 requests per tenant per minute
  message: 'Tenant quota exceeded'
});

app.use('/api/', tenantLimiter);

// Noisy tenant: 429 after request 100
// Normal tenant: 200 always, unaffected by the noisy neighbor
```

Each tenant gets their own counter in Redis. The noisy tenant hits 429, every other tenant sees 200. This is the minimal version of noisy tenant protection. Cheap to add, and it catches the obvious flood cases before they reach your database.

### Kubernetes: namespace quotas and OOM cascade

```yaml
# ResourceQuota per tenant namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
---
# Noisy pod that ignores limits gets OOMKilled
apiVersion: v1
kind: Pod
metadata:
  name: noisy-job
  namespace: tenant-a
spec:
  containers:
  - name: app
    image: busybox
    resources:
      limits:
        memory: "64Mi"
    command: ["/bin/sh", "-c", "while true; do dd if=/dev/zero bs=1M; done"]
    # hits memory limit, gets OOMKilled
    # without namespace quota it could trigger eviction of pods from other tenants
```

The namespace quota caps what tenant-a can consume cluster-wide. Without it, the OOMKilled pod triggers node-pressure eviction that reaches pods from other tenants on the same node. That cascade is surprisingly hard to debug at 2am when three different tenants are filing incidents simultaneously.

Markdown · drag & drop images · ⌘B / ⌘I shortcuts1406 words

For the reviewer

Note to the moderator (optional)

Visible only to the moderator. Helps review go faster.