What is PM2 and how to manage Node.js processes in production?

Node.js~5 min read

PM2 is a production process manager for Node.js that automatically restarts crashed apps, distributes load across CPU cores via clustering, and persists logs to files.

Theory

TL;DR

PM2 is like a restaurant manager who replaces any waiter who quits mid-shift, opens more stations during rush hour, and logs every order without closing the dining room.
Main difference: node server.js dies on crash and uses one CPU core. PM2 restarts automatically, clusters across all cores, and monitors metrics in real time.
Use PM2 when deploying to a VPS or bare server. For local dev, use nodemon. For serverless (Lambda, Vercel), the platform manages processes itself.
pm2 reload and pm2 restart are not the same thing. One is graceful, one causes downtime.

Quick example

bash

# Without PM2 - one crash kills everything
node server.js  # Unhandled error → process dies → manual restart required

# With PM2 - automatic recovery
npm install -g pm2
pm2 start server.js --name api -i max  # cluster mode across all CPU cores
pm2 list                               # online | uptime | restarts: 0
# Simulate a crash: kill the worker process
# PM2 detects the exit signal, restarts within 1 second
# pm2 list now shows: restarts: 1
pm2 stop api
pm2 delete api

One command replaces an entire startup script plus manual monitoring.

Running node server.js ties your app to a single OS process. Any unhandled exception kills it permanently, all traffic goes through one CPU core, and stdout logs disappear on restart. PM2 wraps that process in a supervisor: it catches the exit signal, spawns a replacement within milliseconds, and routes traffic across multiple worker instances using Node's built-in cluster module. The app becomes a service, not a script.

When to use

Single server, Express or Fastify API: pm2 start server.js -i max adds clustering immediately.
Self-hosted Next.js: pm2 start npm --name "next" -- start with a custom server.
NestJS or compiled TypeScript backends: ecosystem file pointing to dist/server.js.
High-traffic app behind Nginx: PM2 handles process supervision, Nginx handles routing.
Local dev: use nodemon instead - it handles hot reload better for development.
Serverless (Lambda, Vercel, Fly.io): the platform manages processes, PM2 adds nothing useful.

Comparison table

Feature	`node app.js`	PM2	nodemon	forever
Auto-restart on crash	No	Yes	Yes (dev only)	Yes
CPU clustering	Manual `cluster` module	Built-in (`-i max`)	No	No
Log persistence	stdout, lost on restart	Rotated files in `~/.pm2/logs/`	Console	Files
Zero-downtime reload	Manual	`pm2 reload`	No	No
Monitoring	None	`pm2 monit` + cloud dashboard	None	Basic
When to use	Scripts, local dev	Production Node servers	Dev hot-reload	Simple restarts (legacy)

How PM2 works internally

PM2 runs as a Node.js master process that forks child processes via OS-level fork() calls, managed through Node's child_process module. It listens to each child's exit codes and process signals (SIGINT, uncaught exceptions) and triggers a restart within milliseconds when the exit code is non-zero. Clustering delegates to Node's built-in cluster module, with one worker per CPU core calculated via os.cpus().length.

The pm2 reload command works by spawning new workers first, waiting for each to emit the "listening" event (meaning the HTTP server is ready), then sending SIGTERM to old workers and waiting for open connections to close. That sequence is what makes zero-downtime actually work.

Ecosystem file

For anything beyond a quick start, use an ecosystem file:

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'api',
    script: 'dist/server.js',   // compiled TypeScript output
    instances: 'max',           // one instance per CPU core
    exec_mode: 'cluster',       // required - without this, instances is ignored
    max_memory_restart: '1G',   // restart worker if it exceeds 1GB RAM
    max_restarts: 5,            // stop retrying after 5 crashes in 60s
    kill_timeout: 5000,         // give workers 5s to drain before SIGKILL
    env_production: {
      NODE_ENV: 'production',
      PORT: 3000
    }
  }]
};

bash

pm2 start ecosystem.config.js --env production
pm2 reload api  # new workers start, old ones finish their requests, then exit
pm2 save        # persist process list across server reboots
pm2 startup     # generate systemd unit file for auto-start on boot

Common mistakes

Starting without a name pm2 start app.js without --name creates an entry called "app" or "server". With multiple services, pm2 list fills up with identically named entries you cannot target individually for stop or reload. Always add --name myapp.

Forgetting exec_mode: 'cluster' in the ecosystem file Setting instances: 'max' without exec_mode: 'cluster' runs a single instance in fork mode. The multi-instance configuration is silently ignored. An 8-core server ends up running one Node.js thread. This explains roughly half of the PM2 performance complaints on Stack Overflow and Reddit.

Using pm2 restart in production deploys pm2 restart kills all workers at once. Active connections drop and return 5xx errors. pm2 reload replaces workers one by one, waiting for each to drain. Always use pm2 reload in CI/CD pipelines.

Running PM2 as root Child processes inherit root permissions. If your app ever runs shell commands, that is a real attack surface. Use a non-root system user and let pm2 startup generate the correct systemd configuration for boot persistence.

Skipping log rotation I have seen this take down a production server at 3am - logs grow to 100GB and fill the disk. Install pm2-logrotate on day one: pm2 install pm2-logrotate. It rotates at 10MB by default.

Real-world usage

Ghost blog, Strapi CMS: pm2 start ecosystem.config.js for clustered API routes.
Self-hosted Next.js: pm2 start npm --name "next" -- start.
NestJS backends: ecosystem file with max_memory_restart: '1G' and compiled dist output.
Feathers.js real-time apps: -i max for Socket.io worker scaling across cores.
PM2 inside Docker: use pm2-runtime as the entrypoint to handle PID 1 correctly and avoid zombie process accumulation.

Follow-up questions

Q: How does PM2 implement zero-downtime reload exactly?
A: It spawns new cluster workers, waits for each to emit the "listening" event (HTTP server ready to accept connections), then sends SIGTERM to old workers and waits for open connections to close before terminating them.

Q: What is the difference between pm2 start -i max and writing the cluster module yourself?
A: PM2 adds automatic per-worker restart, log persistence, and a monitoring layer on top of Node's cluster. If one worker crashes, PM2 restarts that specific worker without touching the others.

Q: What happens when a worker exceeds the memory limit?
A: PM2 polls the V8 heap size and compares it against max_memory_restart. When the limit is exceeded, it restarts that specific worker while others keep serving traffic.

Q: What is the correct way to run PM2 inside Docker?
A: Use pm2-runtime instead of plain pm2 start. It handles PID 1 signal forwarding correctly and prevents zombie process accumulation that plain PM2 misses in a container context.

Q: Senior-level: how does PM2 distinguish a crash from a graceful stop?
A: It listens on child.on('exit') and checks the exit code together with whether PM2 itself sent SIGTERM (from pm2 stop). A non-zero exit code without a prior SIGTERM from PM2 means crash and triggers a restart. After max_restarts attempts within the window, the app moves to "errored" state and PM2 stops retrying.

Examples

Basic: Express API with auto-restart

javascript

// server.js
const express = require('express');
const app = express();

app.get('/', (req, res) => res.send('Hello from PM2'));

app.listen(3000, () => console.log('Running on port 3000'));

bash

pm2 start server.js --name basic-api -i 2
pm2 list
# basic-api | cluster | 2 instances | online | restarts: 0

Kill one of the worker processes manually. PM2 detects the exit and spawns a replacement. The other worker continues serving requests during the recovery window.

Intermediate: Production ecosystem file (NestJS / TypeScript)

// ecosystem.config.js - used in NestJS and Strapi production deploys
module.exports = {
  apps: [{
    name: 'api',
    script: 'dist/main.js',
    instances: 'max',
    exec_mode: 'cluster',
    max_memory_restart: '1G',
    max_restarts: 5,
    kill_timeout: 5000,
    env_production: {
      NODE_ENV: 'production',
      PORT: 3000
    }
  }]
};

bash

pm2 start ecosystem.config.js --env production
pm2 reload api
# During reload: no 5xx errors - new workers accept before old ones exit
pm2 logs api
pm2 save && pm2 startup

The pm2 save and pm2 startup combination persists the process list across server reboots, so nothing needs to be restarted manually after a machine restart.

Advanced: Crash loop protection

Without limits, a bug that crashes the app immediately after startup causes PM2 to restart it in an infinite loop, burning CPU and flooding logs.

// Add to the apps[] entry in ecosystem.config.js
{
  max_restarts: 5,    // give up after 5 crashes
  min_uptime: '10s',  // app must stay up 10s to count as a successful start
  kill_timeout: 5000  // 5s grace period before SIGKILL
}

bash

pm2 start ecosystem.config.js
# App crashes 5 times, each time under 10s uptime
pm2 list
# Status: errored - PM2 stopped retrying
pm2 logs api --lines 50  # inspect the crash reason

After 5 restarts, PM2 marks the app as "errored" and stops. Fix the bug, run pm2 restart api, and the counter resets. No more CPU spikes from infinite restart loops at 3am.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Finished reading?