Part
5
  |  
Production Engineering
  |  
Chapter
22

Monitoring and Alerting

If you can't see your Pi failing, you won't know it failed — and 'it was fine last time I checked' is not a monitoring strategy.
Reading Time
12
mins
BACK TO RASPBERRY PI MASTERCLASS

The most expensive failure I've seen on a Pi wasn't a crash. It was a slow death. A temperature sensor on a production board started returning stale readings after a firmware update — the process was alive, the systemd service showed green, the device responded to pings. But the actual sensor data hadn't updated in four days. Nobody noticed until a downstream system flagged impossible values. Four days of missing data, unrecoverable, because the operator's monitoring strategy was "SSH in and check if it's running."

"If the Pi isn't complaining, it's working fine" is the trap. Pis don't complain. They run silently until the SD card fills up, the CPU thermally throttles, the memory leaks consume every available byte, or a process hangs on a socket it will never hear back from. Without monitoring, you find out about these failures from your users — or worse, from the absence of data you needed last week.

Pis don't complain. They run silently until the SD card fills up, the CPU thermally throttles, or a process hangs on a socket it will never hear back from.

The Three Pillars

Framework · The Dashboard Axiom · DA

Every production Pi needs three things: a health endpoint that reports application-level status, a metrics collector that tracks system vitals over time, and an alert channel that notifies you before failures compound. Miss any one of the three and you're flying blind in exactly the dimension that will eventually fail.

A health endpoint tells you if the application is working right now. Metrics collection tells you how the system has been performing over hours, days, and weeks — the trend data that reveals slow leaks and approaching limits. Alerting tells you when something crosses a threshold that demands attention. Each pillar answers a different question, and no two pillars are substitutes for each other.

Pillar 1: The Health Endpoint

Every service running on a Pi should expose an HTTP endpoint at /health that returns a structured response. This isn't a suggestion — it's the foundation that the other two pillars build on.

# health.py — add this to any Flask application
from flask import jsonify
from gpiozero import CPUTemperature
import psutil
import os

def register_health(app):
    @app.route("/health")
    def health():
        cpu = CPUTemperature()
        disk = psutil.disk_usage("/")
        memory = psutil.virtual_memory()
        
        status = "ok"
        issues = []
        
        if cpu.temperature > 80:
            status = "degraded"
            issues.append(f"CPU temp {cpu.temperature:.1f}°C — throttling likely")
        
        if disk.percent > 90:
            status = "critical"
            issues.append(f"Disk {disk.percent}% full")
        
        if memory.percent > 85:
            status = "degraded"
            issues.append(f"Memory {memory.percent}% used")
        
        return jsonify({
            "status": status,
            "uptime_seconds": int(os.popen("awk '{print $1}' /proc/uptime").read()),
            "cpu_temp_c": round(cpu.temperature, 1),
            "cpu_percent": psutil.cpu_percent(interval=1),
            "memory_percent": round(memory.percent, 1),
            "memory_available_mb": round(memory.available / 1024 / 1024),
            "disk_percent": round(disk.percent, 1),
            "disk_free_gb": round(disk.free / 1024 / 1024 / 1024, 2),
            "issues": issues,
        })

Call it:

curl http://localhost:5000/health
{
  "status": "ok",
  "uptime_seconds": 284712,
  "cpu_temp_c": 52.1,
  "cpu_percent": 12.3,
  "memory_percent": 41.2,
  "memory_available_mb": 2356,
  "disk_percent": 34.7,
  "disk_free_gb": 19.42,
  "issues": []
}

The status field uses three values: ok, degraded, and critical. Any external monitoring system — whether it's a Grafana dashboard, a cron-based checker on another machine, or a cloud uptime service — can poll this endpoint and act on the status. A degraded status means "look at this when you get a chance." A critical status means "act now."

Key takeaway

A /health endpoint that returns structured JSON with status, temperature, memory, and disk usage is the minimum viable monitoring for any Pi service. It takes twenty lines of Python and answers the question "is this Pi actually working?" from anywhere on the network.

For applications with custom metrics — frames processed, messages relayed, detections per minute — add them to the health response:

# Application-specific metrics
app_metrics = {
    "frames_processed_total": frame_counter.value,
    "detections_last_hour": detection_tracker.count_since(hours=1),
    "mqtt_messages_sent": mqtt_counter.value,
    "last_sensor_reading_age_sec": time.time() - last_reading_timestamp,
}

That last_sensor_reading_age_sec field is exactly what would have caught the stale-sensor failure I described in the opening. If the age exceeds thirty seconds, the health endpoint reports degraded. If it exceeds five minutes, critical. The sensor process is alive but not delivering data — and now you know.

Pillar 2: Metrics Collection with Prometheus

A health endpoint tells you the current state. Metrics collection tells you the trend. The temperature is 72 degrees right now — is that normal, or was it 55 degrees yesterday and climbing? You can only answer that question with historical data.

Prometheus is the standard tool for this. It's a time-series database that scrapes HTTP endpoints at regular intervals and stores the results. On a Pi, you use node_exporter to expose system metrics, and optionally a custom exporter for application metrics.

Installing Node Exporter

Node exporter runs on the Pi and exposes system-level metrics (CPU, memory, disk, network, temperature) in Prometheus format:

sudo apt install -y prometheus-node-exporter
sudo systemctl enable prometheus-node-exporter
sudo systemctl start prometheus-node-exporter

Verify it's running:

curl http://localhost:9100/metrics | head -20

You'll see hundreds of metrics in Prometheus format:

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 289412.47
node_cpu_seconds_total{cpu="0",mode="system"} 3241.89
...
node_hwmon_temp_celsius{chip="cpu_thermal",sensor="temp0"} 52.1
...
node_filesystem_avail_bytes{device="/dev/mmcblk0p2",mountpoint="/"} 20861624320

Running Prometheus on the Pi (or remotely)

For a single Pi, Prometheus can run on the same board. It uses roughly 100-150 MB of RAM with a two-week retention period:

sudo apt install -y prometheus

Edit the Prometheus config to scrape node_exporter:

sudo nano /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]
    
  - job_name: "gpio-api"
    metrics_path: "/metrics"
    static_configs:
      - targets: ["localhost:5000"]
sudo systemctl restart prometheus
Remote Prometheus for multiple Pis

If you're running more than two or three Pis, run Prometheus on a separate machine (another Pi, a NAS, or a cloud VM) and point it at all your Pi node_exporters. This keeps the monitoring infrastructure separate from the monitored devices — so when a Pi goes down, you still have its historical metrics.

Adding Custom Application Metrics

For Python applications, the prometheus_client library exposes custom metrics in the same format that Prometheus scrapes:

from prometheus_client import Counter, Gauge, Histogram, start_http_server

# Define metrics
FRAMES_PROCESSED = Counter(
    "frames_processed_total",
    "Total number of camera frames processed",
)
DETECTION_LATENCY = Histogram(
    "detection_latency_seconds",
    "Time to process one frame through the detection model",
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5],
)
SENSOR_VALUE = Gauge(
    "sensor_temperature_celsius",
    "Current temperature from the DHT22 sensor",
)

# Start a metrics endpoint on port 8000
start_http_server(8000)

# In your application code:
def process_frame(frame):
    with DETECTION_LATENCY.time():
        result = model.detect(frame)
    FRAMES_PROCESSED.inc()
    return result

def read_sensor():
    temp = dht_sensor.temperature
    SENSOR_VALUE.set(temp)
    return temp

Add the metrics endpoint to your Prometheus config:

  - job_name: "app-metrics"
    static_configs:
      - targets: ["localhost:8000"]

Now Prometheus scrapes both system metrics (from node_exporter on port 9100) and application metrics (from your app on port 8000) every fifteen seconds.

A health endpoint tells you the current state. Metrics collection tells you the trend. Without the trend, you can't distinguish "this is normal" from "this is about to fail."

Pillar 3: Alerting

Metrics without alerts are a history lesson. They tell you what happened after you noticed the problem. Alerts tell you the problem is happening right now, before it compounds.

Grafana Dashboards and Alerts

Grafana connects to Prometheus and renders dashboards. On a Pi 4 or 5, Grafana runs comfortably alongside Prometheus and node_exporter:

sudo apt install -y grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Open http://<pi-ip>:3000 in a browser (default credentials: admin/admin — change immediately). Add Prometheus as a data source (URL: http://localhost:9090), then create dashboards.

The four panels every Pi dashboard needs:

  1. CPU temperature over time — node_hwmon_temp_celsius — with an alert at 80 degrees (throttling starts at 82)
  2. Disk usage percentage100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) — with an alert at 85%
  3. Memory usage(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 — with an alert at 90%
  4. Service uptime — your application's custom uptime metric, or just the health endpoint status

Grafana supports alert channels for email, Slack, webhooks, PagerDuty, and dozens of other notification targets. Configure at least one. A dashboard that nobody looks at is decoration, not monitoring.

Running Grafana remotely

If memory is tight on your Pi (2 GB model), run Grafana on a separate machine and point it at the Pi's Prometheus instance. Grafana is a visualization layer — it doesn't need to run on the same device it's monitoring.

Alert Rules in Prometheus

For simpler setups, Prometheus can evaluate alert rules directly and send notifications through Alertmanager:

# /etc/prometheus/alert.rules.yml
groups:
  - name: pi-alerts
    rules:
      - alert: HighCPUTemperature
        expr: node_hwmon_temp_celsius{sensor="temp0"} > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Pi CPU temperature above 80°C for 2 minutes"
          
      - alert: DiskAlmostFull
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk usage above 85%"
          
      - alert: ServiceDown
        expr: up{job="gpio-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPIO API is not responding to scrapes"

Add the rules file to prometheus.yml:

rule_files:
  - "alert.rules.yml"
Key takeaway

Metrics without alerts are a history lesson. Configure at least one alert channel — Slack, email, webhook — so your Pi tells you it's failing instead of waiting for you to notice.

The Lightweight Alternative: No Prometheus Required

Prometheus and Grafana are the professional answer. But for a single Pi running one or two services, they might be more infrastructure than you need. Here is a self-contained Python script that monitors system vitals and sends alerts through a Slack webhook or email — no external dependencies beyond psutil and requests:

#!/usr/bin/env python3
"""pi-watchdog.py — lightweight monitoring for a single Pi."""

import time
import json
import smtplib
import requests
import psutil
from email.mime.text import MIMEText
from gpiozero import CPUTemperature
from pathlib import Path

# Configuration
CHECK_INTERVAL = 60  # seconds
SLACK_WEBHOOK = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
ALERT_EMAIL = "you@example.com"
SMTP_HOST = "smtp.gmail.com"
SMTP_PORT = 587
SMTP_USER = "alerts@example.com"
SMTP_PASS = "app-password-here"

THRESHOLDS = {
    "cpu_temp_c": 80,
    "disk_percent": 85,
    "memory_percent": 90,
}

STATE_FILE = Path("/tmp/pi-watchdog-state.json")

def get_vitals():
    cpu = CPUTemperature()
    disk = psutil.disk_usage("/")
    mem = psutil.virtual_memory()
    return {
        "cpu_temp_c": round(cpu.temperature, 1),
        "cpu_percent": psutil.cpu_percent(interval=1),
        "disk_percent": round(disk.percent, 1),
        "disk_free_gb": round(disk.free / 1024**3, 2),
        "memory_percent": round(mem.percent, 1),
        "memory_available_mb": round(mem.available / 1024**2),
    }

def check_thresholds(vitals):
    alerts = []
    for key, limit in THRESHOLDS.items():
        if vitals.get(key, 0) > limit:
            alerts.append(f"{key} = {vitals[key]} (threshold: {limit})")
    return alerts

def send_slack(message):
    if not SLACK_WEBHOOK.startswith("https://hooks.slack.com"):
        return
    requests.post(SLACK_WEBHOOK, json={"text": message}, timeout=10)

def send_email(subject, body):
    msg = MIMEText(body)
    msg["Subject"] = subject
    msg["From"] = SMTP_USER
    msg["To"] = ALERT_EMAIL
    with smtplib.SMTP(SMTP_HOST, SMTP_PORT) as server:
        server.starttls()
        server.login(SMTP_USER, SMTP_PASS)
        server.send_message(msg)

def load_state():
    if STATE_FILE.exists():
        return json.loads(STATE_FILE.read_text())
    return {"last_alert_time": 0}

def save_state(state):
    STATE_FILE.write_text(json.dumps(state))

def main():
    import socket
    hostname = socket.gethostname()
    
    while True:
        vitals = get_vitals()
        alerts = check_thresholds(vitals)
        state = load_state()
        
        if alerts:
            # Rate-limit: don't alert more than once per 30 minutes
            if time.time() - state["last_alert_time"] > 1800:
                message = (
                    f"[{hostname}] Pi alert:\n"
                    + "\n".join(f"  - {a}" for a in alerts)
                    + f"\n\nFull vitals: {json.dumps(vitals, indent=2)}"
                )
                send_slack(message)
                try:
                    send_email(f"Pi Alert: {hostname}", message)
                except Exception:
                    pass  # Email is best-effort
                
                state["last_alert_time"] = time.time()
                save_state(state)
        
        time.sleep(CHECK_INTERVAL)

if __name__ == "__main__":
    main()

Run it as a systemd service (you know how to do this from the previous chapter):

[Unit]
Description=Pi Watchdog — lightweight monitoring
After=network-online.target

[Service]
Type=simple
User=pi
ExecStart=/home/pi/monitoring/venv/bin/python pi-watchdog.py
WorkingDirectory=/home/pi/monitoring
Restart=always
RestartSec=10
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

This gives you temperature, disk, and memory alerting with zero infrastructure beyond the script itself. It won't win any observability awards, but it catches the failures that matter — and it runs on a Pi Zero with resources to spare.

✕ Full stack (Prometheus + Grafana)
  • Historical metrics with retention
  • Rich dashboards and visualization
  • Complex alert rules with durations
  • ~250 MB RAM on the Pi
  • Right for multi-Pi deployments
✓ Lightweight (Python watchdog)
  • Current-state checking only
  • No dashboards — alerts only
  • Simple threshold checking
  • ~15 MB RAM
  • Right for single-Pi deployments

Custom Application Metrics That Actually Matter

System metrics (CPU, memory, disk) tell you if the Pi is healthy. Application metrics tell you if the work is getting done. The distinction matters: a Pi can show 20% CPU usage, 40% memory, and plenty of disk space while the application is stuck in a retry loop processing zero frames.

The metrics that matter depend on your application, but here are the categories that cover most Pi deployments:

  • Throughput metrics: frames processed per minute, messages relayed per second, sensor readings per cycle. If this number drops to zero, your application is dead even if the process is alive.
  • Latency metrics: time to process one frame, time to complete one MQTT publish, time to read one sensor. If this number climbs, you're approaching a resource wall.
  • Error metrics: failed sensor reads, dropped MQTT connections, HTTP 500 responses. A non-zero error rate is normal. A climbing error rate is a warning.
  • Staleness metrics: seconds since the last successful sensor reading, seconds since the last successful API call. This is the metric that catches the "process is alive but not doing anything" failure mode.
The staleness check

If you instrument only one custom metric, make it a staleness gauge — the number of seconds since your application last completed its primary task. A staleness value that exceeds two or three cycle times is the earliest possible signal that something is wrong. I've seen this pattern where every other metric looks green but the application stopped processing ten minutes ago. The staleness gauge is the only metric that catches this.

Monitoring Docker Containers

If your Pi runs Docker (from Chapter 19), monitoring the containers adds another dimension. Docker exposes a metrics endpoint that Prometheus can scrape:

# Enable Docker metrics
sudo nano /etc/docker/daemon.json
{
  "metrics-addr": "127.0.0.1:9323",
  "experimental": true
}
sudo systemctl restart docker

Add it to Prometheus:

  - job_name: "docker"
    static_configs:
      - targets: ["localhost:9323"]

For per-container metrics (CPU, memory, network per container), cAdvisor is the standard tool — but it's heavy for a Pi. A lighter alternative is parsing docker stats output:

import subprocess
import json

def get_container_stats():
    result = subprocess.run(
        ["docker", "stats", "--no-stream", "--format",
         '{"name":"{{.Name}}","cpu":"{{.CPUPerc}}","mem":"{{.MemUsage}}","net":"{{.NetIO}}"}'],
        capture_output=True, text=True,
    )
    stats = []
    for line in result.stdout.strip().split("\n"):
        if line:
            stats.append(json.loads(line))
    return stats

This gives you per-container resource usage without running a full cAdvisor instance.

Key takeaway

System metrics tell you if the Pi is healthy. Application metrics tell you if the work is getting done. Monitor both — a healthy Pi running a dead application is worse than a Pi that's obviously crashed, because it takes longer to notice.

What to Do Monday Morning

Add a /health endpoint to your primary Pi service

Return JSON with status, cpu_temp_c, memory_percent, disk_percent, and at least one application-specific metric (like last_reading_age_sec). Test it with curl. This endpoint is the foundation — everything else builds on it.

Install node_exporter

sudo apt install prometheus-node-exporter and confirm it's serving metrics at http://localhost:9100/metrics. Even if you don't set up Prometheus immediately, node_exporter is ready when you need it.

Choose your monitoring stack

Running multiple Pis or need historical dashboards? Install Prometheus and Grafana. Running one Pi and just need alerts? Deploy the Python watchdog script from this chapter as a systemd service. Either path gives you monitoring — the worst choice is neither.

Configure at least one alert channel

A Slack webhook takes five minutes to set up. An email alert takes ten. Pick one and configure it. Then trigger a test alert by temporarily lowering a threshold. An alert channel you've never seen fire is an alert channel you can't trust.

Add a staleness metric to your application

Track the time since your application last completed its primary task. Alert if that time exceeds 2-3x the normal cycle time. This one metric catches the failure mode that every other metric misses: a process that's alive but not working.

Test your monitoring by causing a failure

Fill the disk to 90% with a temporary file (dd if=/dev/zero of=/tmp/fill bs=1M count=5000). Watch the alert fire. Delete the file. Confirm the alert resolves. If you've never seen your alerting work, you don't know if it works.

The trap is assuming that silence means health. A Pi that isn't reporting its status is a Pi you know nothing about. Monitoring is not overhead — it's the difference between discovering a failure in minutes and discovering it in weeks. Every production Pi needs a health endpoint, a metrics collector, and an alert channel. Without all three, you're guessing. And guessing isn't engineering.

A Pi that isn't reporting its status is a Pi you know nothing about. Monitoring is not overhead — it's the difference between discovering a failure in minutes and discovering it in weeks.