What is PM2 and how to manage Node.js processes in production?
PM2 is a production process manager for Node.js that automatically restarts crashed apps, distributes load across CPU cores via clustering, and persists logs to files.
Theory
TL;DR
- PM2 is like a restaurant manager who replaces any waiter who quits mid-shift, opens more stations during rush hour, and logs every order without closing the dining room.
- Main difference:
node server.jsdies on crash and uses one CPU core. PM2 restarts automatically, clusters across all cores, and monitors metrics in real time. - Use PM2 when deploying to a VPS or bare server. For local dev, use nodemon. For serverless (Lambda, Vercel), the platform manages processes itself.
pm2 reloadandpm2 restartare not the same thing. One is graceful, one causes downtime.
Quick example
# Without PM2 - one crash kills everything
node server.js # Unhandled error → process dies → manual restart required
# With PM2 - automatic recovery
npm install -g pm2
pm2 start server.js --name api -i max # cluster mode across all CPU cores
pm2 list # online | uptime | restarts: 0
# Simulate a crash: kill the worker process
# PM2 detects the exit signal, restarts within 1 second
# pm2 list now shows: restarts: 1
pm2 stop api
pm2 delete apiOne command replaces an entire startup script plus manual monitoring.
Key difference
Running node server.js ties your app to a single OS process. Any unhandled exception kills it permanently, all traffic goes through one CPU core, and stdout logs disappear on restart. PM2 wraps that process in a supervisor: it catches the exit signal, spawns a replacement within milliseconds, and routes traffic across multiple worker instances using Node's built-in cluster module. The app becomes a service, not a script.
When to use
- Single server, Express or Fastify API:
pm2 start server.js -i maxadds clustering immediately. - Self-hosted Next.js:
pm2 start npm --name "next" -- startwith a custom server. - NestJS or compiled TypeScript backends: ecosystem file pointing to
dist/server.js. - High-traffic app behind Nginx: PM2 handles process supervision, Nginx handles routing.
- Local dev: use nodemon instead - it handles hot reload better for development.
- Serverless (Lambda, Vercel, Fly.io): the platform manages processes, PM2 adds nothing useful.
Comparison table
| Feature | node app.js | PM2 | nodemon | forever |
|---|---|---|---|---|
| Auto-restart on crash | No | Yes | Yes (dev only) | Yes |
| CPU clustering | Manual cluster module | Built-in (-i max) | No | No |
| Log persistence | stdout, lost on restart | Rotated files in ~/.pm2/logs/ | Console | Files |
| Zero-downtime reload | Manual | pm2 reload | No | No |
| Monitoring | None | pm2 monit + cloud dashboard | None | Basic |
| When to use | Scripts, local dev | Production Node servers | Dev hot-reload | Simple restarts (legacy) |
How PM2 works internally
PM2 runs as a Node.js master process that forks child processes via OS-level fork() calls, managed through Node's child_process module. It listens to each child's exit codes and process signals (SIGINT, uncaught exceptions) and triggers a restart within milliseconds when the exit code is non-zero. Clustering delegates to Node's built-in cluster module, with one worker per CPU core calculated via os.cpus().length.
The pm2 reload command works by spawning new workers first, waiting for each to emit the "listening" event (meaning the HTTP server is ready), then sending SIGTERM to old workers and waiting for open connections to close. That sequence is what makes zero-downtime actually work.
Ecosystem file
For anything beyond a quick start, use an ecosystem file:
// ecosystem.config.js
module.exports = {
apps: [{
name: 'api',
script: 'dist/server.js', // compiled TypeScript output
instances: 'max', // one instance per CPU core
exec_mode: 'cluster', // required - without this, instances is ignored
max_memory_restart: '1G', // restart worker if it exceeds 1GB RAM
max_restarts: 5, // stop retrying after 5 crashes in 60s
kill_timeout: 5000, // give workers 5s to drain before SIGKILL
env_production: {
NODE_ENV: 'production',
PORT: 3000
}
}]
};pm2 start ecosystem.config.js --env production
pm2 reload api # new workers start, old ones finish their requests, then exit
pm2 save # persist process list across server reboots
pm2 startup # generate systemd unit file for auto-start on bootCommon mistakes
Starting without a name
pm2 start app.js without --name creates an entry called "app" or "server". With multiple services, pm2 list fills up with identically named entries you cannot target individually for stop or reload. Always add --name myapp.
Forgetting exec_mode: 'cluster' in the ecosystem file
Setting instances: 'max' without exec_mode: 'cluster' runs a single instance in fork mode. The multi-instance configuration is silently ignored. An 8-core server ends up running one Node.js thread. This explains roughly half of the PM2 performance complaints on Stack Overflow and Reddit.
Using pm2 restart in production deploys
pm2 restart kills all workers at once. Active connections drop and return 5xx errors. pm2 reload replaces workers one by one, waiting for each to drain. Always use pm2 reload in CI/CD pipelines.
Running PM2 as root
Child processes inherit root permissions. If your app ever runs shell commands, that is a real attack surface. Use a non-root system user and let pm2 startup generate the correct systemd configuration for boot persistence.
Skipping log rotation
I have seen this take down a production server at 3am - logs grow to 100GB and fill the disk. Install pm2-logrotate on day one: pm2 install pm2-logrotate. It rotates at 10MB by default.
Real-world usage
- Ghost blog, Strapi CMS:
pm2 start ecosystem.config.jsfor clustered API routes. - Self-hosted Next.js:
pm2 start npm --name "next" -- start. - NestJS backends: ecosystem file with
max_memory_restart: '1G'and compiled dist output. - Feathers.js real-time apps:
-i maxfor Socket.io worker scaling across cores. - PM2 inside Docker: use
pm2-runtimeas the entrypoint to handle PID 1 correctly and avoid zombie process accumulation.
Follow-up questions
Q: How does PM2 implement zero-downtime reload exactly?
A: It spawns new cluster workers, waits for each to emit the "listening" event (HTTP server ready to accept connections), then sends SIGTERM to old workers and waits for open connections to close before terminating them.
Q: What is the difference between pm2 start -i max and writing the cluster module yourself?
A: PM2 adds automatic per-worker restart, log persistence, and a monitoring layer on top of Node's cluster. If one worker crashes, PM2 restarts that specific worker without touching the others.
Q: What happens when a worker exceeds the memory limit?
A: PM2 polls the V8 heap size and compares it against max_memory_restart. When the limit is exceeded, it restarts that specific worker while others keep serving traffic.
Q: What is the correct way to run PM2 inside Docker?
A: Use pm2-runtime instead of plain pm2 start. It handles PID 1 signal forwarding correctly and prevents zombie process accumulation that plain PM2 misses in a container context.
Q: Senior-level: how does PM2 distinguish a crash from a graceful stop?
A: It listens on child.on('exit') and checks the exit code together with whether PM2 itself sent SIGTERM (from pm2 stop). A non-zero exit code without a prior SIGTERM from PM2 means crash and triggers a restart. After max_restarts attempts within the window, the app moves to "errored" state and PM2 stops retrying.
Examples
Basic: Express API with auto-restart
// server.js
const express = require('express');
const app = express();
app.get('/', (req, res) => res.send('Hello from PM2'));
app.listen(3000, () => console.log('Running on port 3000'));pm2 start server.js --name basic-api -i 2
pm2 list
# basic-api | cluster | 2 instances | online | restarts: 0Kill one of the worker processes manually. PM2 detects the exit and spawns a replacement. The other worker continues serving requests during the recovery window.
Intermediate: Production ecosystem file (NestJS / TypeScript)
// ecosystem.config.js - used in NestJS and Strapi production deploys
module.exports = {
apps: [{
name: 'api',
script: 'dist/main.js',
instances: 'max',
exec_mode: 'cluster',
max_memory_restart: '1G',
max_restarts: 5,
kill_timeout: 5000,
env_production: {
NODE_ENV: 'production',
PORT: 3000
}
}]
};pm2 start ecosystem.config.js --env production
pm2 reload api
# During reload: no 5xx errors - new workers accept before old ones exit
pm2 logs api
pm2 save && pm2 startupThe pm2 save and pm2 startup combination persists the process list across server reboots, so nothing needs to be restarted manually after a machine restart.
Advanced: Crash loop protection
Without limits, a bug that crashes the app immediately after startup causes PM2 to restart it in an infinite loop, burning CPU and flooding logs.
// Add to the apps[] entry in ecosystem.config.js
{
max_restarts: 5, // give up after 5 crashes
min_uptime: '10s', // app must stay up 10s to count as a successful start
kill_timeout: 5000 // 5s grace period before SIGKILL
}pm2 start ecosystem.config.js
# App crashes 5 times, each time under 10s uptime
pm2 list
# Status: errored - PM2 stopped retrying
pm2 logs api --lines 50 # inspect the crash reasonAfter 5 restarts, PM2 marks the app as "errored" and stops. Fix the bug, run pm2 restart api, and the counter resets. No more CPU spikes from infinite restart loops at 3am.
Short Answer
Interview readyA concise answer to help you respond confidently on this topic during an interview.