Self-Hosting n8n at Scale: My Production Setup | The Workflow Engineer

This is not a scale problem. It is an assembly problem.

Framework · The four-component stack · n8n + Postgres + Redis + proxy

Production n8n is not a single container. It is a system of four components that must be tuned as one unit. Remove one, or treat any of them as an afterthought, and you are not running production infrastructure. You are running a prototype that happens to be customer-facing.

The Foundation: Docker Compose with Real Health Checks

I start every production deployment with a pinned docker-compose.yml. Not n8nio/n8n:latest. Never. Pinning is non-negotiable because latest is a moving target that can introduce schema migrations, node behavior changes, or breaking API shifts without warning. I pin to a specific version — say 1.94.1 — and promote upgrades through a staging environment after testing critical workflows.

Losing the encryption key is terminal

Every stored credential — every OAuth token, every database password, every API key — is encrypted with N8N_ENCRYPTION_KEY. If the container volume is wiped and the key is gone, those credentials do not decrypt. They turn into garbage. Generate it once with openssl rand -hex 32, store it in .env, and back it up in two places outside the host.

Log rotation is the next silent killer. Docker's json-file driver has no size limit by default. On a busy instance, n8n can write tens of gigabytes of execution logs in weeks. When the disk fills, Postgres dies first — it needs free space for WAL files — and then n8n follows. I cap every container at three rotated files, fifty megabytes each.

The most important part of the Compose foundation, though, is health checks. Docker marks a container "running" the moment the process starts, which means it will happily route traffic to an n8n instance that is still booting, or to a Postgres that is still initialising its data directory.

Key takeaway

A real health check for Postgres uses pg_isready with a generous 30-second start_period. For n8n, hit /healthz with a 60-second start period. The depends_on condition uses service_healthy — n8n doesn't start until Postgres is actually accepting connections.

services:
  postgres:
    image: postgres:16-alpine
    restart: unless-stopped
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  n8n:
    image: n8nio/n8n:1.94.1
    restart: unless-stopped
    environment:
      DB_TYPE: postgresdb
      DB_POSTGRESDB_HOST: postgres
      DB_POSTGRESDB_PORT: 5432
      DB_POSTGRESDB_DATABASE: ${POSTGRES_DB}
      DB_POSTGRESDB_USER: ${POSTGRES_USER}
      DB_POSTGRESDB_PASSWORD: ${POSTGRES_PASSWORD}
      N8N_HOST: ${N8N_HOST}
      N8N_PROTOCOL: https
      WEBHOOK_URL: https://${N8N_HOST}/
      N8N_ENCRYPTION_KEY: ${N8N_ENCRYPTION_KEY}
      GENERIC_TIMEZONE: ${GENERIC_TIMEZONE}
      EXECUTIONS_DATA_PRUNE: "true"
      EXECUTIONS_DATA_MAX_AGE: 168
    volumes:
      - n8n_data:/home/node/.n8n
    depends_on:
      postgres:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://localhost:5678/healthz || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "3"

Reverse Proxy and TLS Termination

n8n's built-in server does not speak TLS. Running it exposed means credentials, webhook payloads, and OAuth tokens travel in plaintext. A reverse proxy is not a nice-to-have; it is the security boundary. I also use it to enforce request size limits, proxy WebSocket connections for the editor, and apply rate limiting before traffic ever reaches n8n.

I default to Caddy for new deployments. It obtains and renews Let's Encrypt certificates automatically, proxies WebSocket upgrades without extra configuration, and handles HTTP/2 with a two-line Caddyfile.

n8n.example.com {
    reverse_proxy n8n:5678 {
        header_up X-Forwarded-Proto {scheme}
    }
}

The corresponding n8n environment variables must tell the application it lives behind HTTPS:

N8N_HOST=n8n.example.com
N8N_PROTOCOL=https
WEBHOOK_URL=https://n8n.example.com/
N8N_EDITOR_BASE_URL=https://n8n.example.com/

If you skip WEBHOOK_URL, n8n generates webhook paths using http://localhost:5678, and external services cannot reach them. I have debugged this exact misconfiguration more times than I can count.

At the proxy layer, I also set client_max_body_size to match N8N_PAYLOAD_SIZE_MAX — commonly 256 MB for file-processing workflows — and apply rate limiting. Webhook endpoints get a stricter zone than the editor UI. A misconfigured third-party integration can easily send thousands of duplicate events in a minute; I cap webhook paths at thirty requests per minute with a small burst buffer.

PostgreSQL: The Default Settings Are Wrong

SQLite is fine for a laptop. In production, I use PostgreSQL exclusively. But the Postgres container fresh from Docker Hub is tuned for a development machine with minimal memory and conservative connection limits. Under real load — hundreds of concurrent executions, large binary data, or long-running transactions — those defaults become a bottleneck.

I tune memory and connection settings for the host size. shared_buffers, work_mem, and max_connections all need adjustment. The exact values depend on whether the database shares the host with n8n or runs on its own box, but the principle is consistent: the defaults will suffocate you quietly.

Beyond kernel tuning, I aggressively prune execution data. n8n can accumulate massive execution histories. I set EXECUTIONS_DATA_PRUNE=true and EXECUTIONS_DATA_MAX_AGE=168 (seven days) so the database does not grow without bound.

The database is also where your disaster recovery story lives or dies. I run a daily pg_dump at 02:00 UTC, pipe it through gzip, and push it to an S3-compatible bucket with a thirty-day retention policy. The script verifies the backup file is not empty before uploading, and I test the restore process quarterly.

A backup that has never been restored is a fantasy, not a backup.

Queue Mode and Redis: The Scaling Boundary

The difference between a hobby deployment and a production system that handles real throughput is EXECUTIONS_MODE=queue. In regular mode, workflow executions run inside the main n8n process. You are bound to a single CPU core and one event loop. The moment webhook volume spikes or a long-running workflow blocks, everything else queues behind it.

Queue mode introduces Redis as a message broker between the main n8n instance and worker processes. The main instance accepts webhooks and API calls, pushes jobs into Redis, and workers pull them off. You can scale workers horizontally to match load.

EXECUTIONS_MODE=queue
QUEUE_BULL_REDIS_HOST=redis
QUEUE_BULL_REDIS_PORT=6379
QUEUE_BULL_REDIS_PASSWORD=${REDIS_PASSWORD}
QUEUE_HEALTH_CHECK_ACTIVE=true

The gotchas are operational:

Workers must share the same database and encryption key as the main instance.
Multiple instances behind a load balancer — only one should handle the web UI and webhook ingestion while the others function as workers, or you need to be explicit about roles.
Database migrations run automatically on startup — in a multi-instance deployment, this can create races if every container tries to migrate simultaneously. Start one instance first, confirm health, then scale workers.

I also enable QUEUE_HEALTH_CHECK_ACTIVE so worker health is visible through the same monitoring endpoints. If a worker dies silently, Redis still holds the jobs, but nothing processes them.

Hardening the Surface

Security in a self-hosted n8n deployment is not a separate workstream; it is part of the assembly. I block Code node access to environment variables with N8N_BLOCK_ENV_ACCESS_IN_NODE=true. Any workflow — especially community templates or imports — can otherwise read the entire process environment, which includes database passwords, encryption keys, and API secrets.

For webhooks, I never rely on obscurity. Public webhook URLs are discoverable. I layer:

IP allowlisting at the reverse proxy.
Header authentication inside the Webhook node.
HMAC signature verification in a Code node before processing the payload (Stripe, GitHub, Shopify).
Rate limiting at the proxy as the final guard against accidental floods or deliberate abuse.

The editor itself is an administrative interface with full code execution capability. Exposing it to the public internet is unnecessary risk. Where possible, I place n8n behind Tailscale or WireGuard so team members reach the editor through a mesh VPN, while webhooks remain publicly accessible through a carefully restricted path. If the editor must face the internet, I pair it with fail2ban on the Nginx access logs to block brute-force attempts against /rest/login after five failures.

The Observability Triangle

Framework · The observability triangle · logs + metrics + synthetic traces

Run two of the three and you have a blind spot. Logs tell you what happened after the fact. Metrics tell you when something is degrading before it breaks. Synthetic traces confirm that the system actually works from the user's perspective.

Logs start with Docker's json-file driver, but they cannot stay on the host. I cap them with rotation and ship them to a centralised system. n8n's execution logs are verbose; without centralisation, debugging a failure means SSHing into the host and grepping a file that might have already rotated away.

Metrics cover disk space, memory, CPU, Postgres connection count, and Redis queue depth. Uptime Kuma is a decent starting point for external monitoring, but it only checks reachability. I configure three monitors: the /healthz endpoint, the editor page for keyword presence, and a test webhook that exercises an actual end-to-end workflow.

`/healthz` lies

/healthz returns 200 even if the database connection is dead. A synthetic workflow monitor is the only way to catch the "running but unable to work" failure mode.

Traces, in this context, are the synthetic end-to-end tests. A scheduled workflow that pings a health-check endpoint, queries the database, and posts to a private Slack channel every five minutes is a living proof that the system is not just up, but functional. When that workflow fails, I know the problem is real before any user reports it.

Disaster Recovery and Zero-Downtime Upgrades

Backups are two-layered. The database gets daily pg_dump exports with off-site retention. The encryption key is backed up separately from the database — if they live in the same bucket and that bucket is compromised, you have nothing. I also version-control the docker-compose.yml and Caddyfile, but never the .env or secrets directory.

For upgrades, I avoid the standard docker compose pull && docker compose up -d dance in production. That creates a 30–120 second window where webhooks bounce and active executions terminate. Instead, I use a blue-green pattern: run the new n8n version alongside the old one, verify health, switch the reverse proxy upstream, and only then stop the old container.

# Start new version
docker compose -f docker-compose.yml up -d n8n-green

# Wait for health
timeout 180 bash -c 'until docker inspect --format="{{.State.Health.Status}}" n8n-green-1 | grep -q healthy; do sleep 5; done'

# Switch proxy upstream and reload
sed -i 's/n8n-blue:5678/n8n-green:5678/' Caddyfile
docker exec caddy-1 caddy reload --config /etc/caddy/Caddyfile

# Stop old version
docker compose stop n8n-blue

Migrations are one-way

Because both versions share the same Postgres database, n8n's migration system handles schema changes on startup. Once the green instance starts and mutates the schema, rolling back requires restoring the database.

What to Change on Monday Morning

If you are running n8n in production today, here is your checklist for next week.

Pin your image tag

Replace latest with the specific version you are currently running. Commit the file.

Add real health checks to your Compose file

Postgres gets pg_isready with a 30-second start period. n8n gets a wget against /healthz with a 60-second start period. Use depends_on with condition: service_healthy.

Cap execution retention

Set EXECUTIONS_DATA_PRUNE=true and limit retention to seven days unless compliance requires more. Database bloat is silent until catastrophic.

Generate and back up the encryption key

Generate a real N8N_ENCRYPTION_KEY if you don't have one, back it up in two locations outside the host, and verify it loads from .env — not from an ephemeral container volume.

Switch to queue mode

Add Redis to your four-component stack. Even a single worker process buys you isolation between the webhook receiver and the execution engine.

Put a reverse proxy in front of n8n

Caddy takes ten minutes to configure and removes the certificate management burden forever. Set WEBHOOK_URL to match the public hostname.

Schedule a daily pg_dump to external storage

Test the restore. Quarterly.

Add one synthetic monitor that exercises a real workflow

Checking /healthz is not enough. You need to know that n8n can receive a webhook, read from Postgres, and complete an execution.

That single synthetic monitor will catch more failures than every resource graph combined.