Part
5
  |  
Production Engineering
  |  
Chapter
21

Systemd and Reliability

Every Pi project that runs with nohup or a cron @reboot entry is one unhandled exception away from silent death.
Reading Time
11
mins
BACK TO RASPBERRY PI MASTERCLASS

I can predict exactly when a Pi project dies. It's not when the power supply fails or the SD card corrupts. It's three weeks after deployment, when a Python exception crashes the main process at 2 AM, and nobody notices because the script was started with nohup python app.py & and there's nothing watching it. The Pi is still powered on. The LED is still blinking. SSH still works. But the application — the reason the Pi exists — has been dead for days. Maybe weeks. Nobody knows until someone checks manually, which is exactly the check nobody scheduled.

This failure mode is so common it should have a name. I'll give it one: the silent-death problem. And the solution has been built into every Linux distribution for over a decade. It's called systemd.

The Pi is still powered on. SSH still works. But the application has been dead for days. Nobody knows until someone checks manually.

Why nohup and cron @reboot Aren't Enough

Here is the typical progression of a Pi project's deployment strategy:

  1. Development: python app.py in a terminal window. Works until you close the SSH session.
  2. First fix: nohup python app.py &. Works until the script crashes.
  3. Second fix: cron @reboot entry. Starts the script on boot, but if it crashes after boot, it stays dead until the next reboot.
  4. Third fix: A bash while true loop that restarts the script. Works until the loop itself crashes, or until you need to check logs and realize they went to /dev/null.
  5. Giving up: "It mostly works. I just restart the Pi every few days."

Each "fix" adds complexity while solving only one failure mode. Systemd solves all of them — automatic start on boot, automatic restart on crash, structured logging, dependency ordering, resource limits, and watchdog supervision — with a single configuration file.

Framework · The Systemd Contract · SC

If your script isn't a systemd service, it's not production. Systemd gives you automatic restart on crash, dependency ordering, resource limits, and structured logging. A cron job gives you a prayer.

Anatomy of a Service Unit File

A systemd service is defined by a .service file in /etc/systemd/system/. Here is a complete, production-grade unit file for the Flask GPIO API from the Docker chapter — but this time running directly on the host:

[Unit]
Description=GPIO Flask API
Documentation=https://github.com/youruser/pi-gpio-api
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=pi
Group=pi
WorkingDirectory=/home/pi/gpio-api
Environment=FLASK_ENV=production
Environment=PYTHONUNBUFFERED=1
ExecStart=/home/pi/gpio-api/venv/bin/python app.py

# Restart policy
Restart=always
RestartSec=5

# Resource limits
MemoryMax=256M
CPUQuota=50%

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/home/pi/gpio-api/data

# Watchdog
WatchdogSec=30

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=gpio-api

[Install]
WantedBy=multi-user.target

Every directive earns its place. Here is what each section does and why it matters.

The [Unit] Section

[Unit]
Description=GPIO Flask API
After=network-online.target
Wants=network-online.target

After=network-online.target tells systemd to start this service only after the network is fully up — not just after the network interface exists, but after it has an IP address. This matters for any service that binds to a network port or connects to a remote resource. Without it, your Flask app tries to bind before the network is ready and fails on roughly 30% of boots.

Wants=network-online.target is the soft version of Requires= — if the network target fails, systemd still attempts to start your service. Use Requires= instead if your service genuinely cannot function without network.

The [Service] Section

[Service]
Type=simple
User=pi
WorkingDirectory=/home/pi/gpio-api
ExecStart=/home/pi/gpio-api/venv/bin/python app.py

Type=simple means systemd considers the service started as soon as ExecStart forks. This is correct for most Python applications that run a main loop. Use Type=notify if your application uses the systemd notification protocol (more on this below).

User=pi runs the service as the pi user, not root. Never run application services as root unless they require privileged hardware access that can't be granted through device groups.

WorkingDirectory sets the current directory before executing the command. Without this, relative file paths in your Python code resolve against /, not your project directory — a bug that's invisible in development and catastrophic in production.

ExecStart uses the full path to the Python binary inside the virtualenv. This is critical. If you write ExecStart=python app.py, systemd uses the system Python, which doesn't have your project's dependencies installed. Always use the absolute path to the venv interpreter.

PYTHONUNBUFFERED=1

Setting PYTHONUNBUFFERED=1 in the Environment directive forces Python to flush stdout and stderr immediately instead of buffering. Without it, your logs appear in journalctl in irregular bursts rather than in real time — which makes debugging a running service significantly harder.

Restart Policy

Restart=always
RestartSec=5

Restart=always means systemd restarts the process no matter how it exits — clean exit, crash, signal, out-of-memory kill. This is the single most important directive for production reliability. A Python exception that takes down your process is no longer a permanent failure; it's a five-second interruption.

RestartSec=5 adds a five-second delay between crash and restart. This prevents a crash loop from consuming all CPU time. If your service crashes, waits five seconds, starts, and immediately crashes again, systemd will eventually rate-limit it (default: no more than 5 starts in 10 seconds) and mark it as failed. That's the right behavior — it means the failure needs human attention, not infinite retries.

A Python exception that takes down your process is no longer a permanent failure; it's a five-second interruption.

Key takeaway

Restart=always with RestartSec=5 is the most important reliability feature you can add to a Pi deployment. It transforms every crash from a permanent failure into a brief interruption that you can investigate later.

Resource Limits

MemoryMax=256M
CPUQuota=50%

These directives prevent a misbehaving service from taking down the entire Pi. MemoryMax=256M tells systemd to kill the process if it exceeds 256 MB of memory — protecting the system from memory leaks. CPUQuota=50% limits the service to half of one CPU core (on a quad-core Pi, that's 12.5% of total compute).

I've seen this pattern where a Python application with a slow memory leak runs fine for weeks, then one morning the Pi is completely unresponsive because the application consumed all available RAM. The kernel's OOM killer might save the system — or it might kill SSH instead, making the Pi unreachable until someone physically power-cycles it. MemoryMax prevents this by killing the offending service and letting Restart=always bring it back with fresh memory.

Watchdog Timer

WatchdogSec=30

The watchdog goes beyond simple "is the process alive?" checking. It verifies that your application is actually working, not just hanging. To use it, your application must periodically notify systemd that it's healthy:

import sdnotify

notifier = sdnotify.SystemdNotifier()
notifier.notify("READY=1")  # Tell systemd the service is ready

# In your main loop or a background thread:
while True:
    # Do your work
    process_data()
    
    # Signal the watchdog
    notifier.notify("WATCHDOG=1")
    time.sleep(10)

If systemd doesn't receive a WATCHDOG=1 notification within 30 seconds, it considers the service hung and restarts it. This catches the failure mode that Restart=always misses: a process that's technically alive but stuck in a deadlock, an infinite loop, or waiting on a network call that will never return.

Install the Python notifier:

pip install sdnotify

And change the service type to notify:

Type=notify
Watchdog without code changes

If you can't modify your application to send watchdog notifications, systemd can still check if the process is alive using the default behavior of Restart=always. The watchdog adds value only when your application cooperates. Don't add WatchdogSec without the corresponding code — systemd will kill your perfectly healthy service every 30 seconds.

Managing the Service

Enable the service to start on boot:

sudo systemctl enable gpio-api.service

Start it now:

sudo systemctl start gpio-api.service

Check its status:

sudo systemctl status gpio-api.service

The output tells you whether the service is running, how long it's been running, its PID, its memory usage, and the last few log lines. This is more information than you'll get from any other deployment method with a single command.

Here is what the output looks like for a healthy service:

● gpio-api.service - GPIO Flask API — production
     Loaded: loaded (/etc/systemd/system/gpio-api.service; enabled)
     Active: active (running) since Mon 2026-05-26 08:12:33 UTC; 3 days ago
   Main PID: 1247 (python)
      Tasks: 4 (limit: 4582)
     Memory: 87.3M (max: 256.0M)
        CPU: 12min 34.567s
     CGroup: /system.slice/gpio-api.service
             └─1247 /home/pi/gpio-api/venv/bin/python app.py

That Memory: 87.3M (max: 256.0M) line tells you instantly how close the service is to its resource limit. The Active: active (running) since ... 3 days ago confirms uptime at a glance. No other deployment method gives you this much diagnostic information with a single command.

Journalctl: Your Pi's Black Box Recorder

View the full logs:

# All logs since the service started
journalctl -u gpio-api.service

# Follow logs in real time
journalctl -u gpio-api.service -f

# Logs since last boot
journalctl -u gpio-api.service -b

# Logs from the last hour
journalctl -u gpio-api.service --since "1 hour ago"

# Output as JSON for programmatic parsing
journalctl -u gpio-api.service -o json --since "1 hour ago"

Journalctl is the single biggest upgrade over traditional log files. With nohup, your logs go to nohup.out — a file that grows until it fills the SD card, has no timestamps unless your application adds them, and gets truncated or deleted when you don't know what else to do. With journalctl, logs are structured, timestamped, rotated automatically, and queryable by time range, boot session, or priority level. The journal even survives reboots by default, so when your Pi crashes at 3 AM and restarts, the logs from before the crash are still there waiting for you.

One more pattern worth mentioning: log priorities. If your application writes to stderr with syslog-style priority prefixes, systemd respects them:

import sys
print("<3>Critical error: sensor read failed", file=sys.stderr)   # Error
print("<4>Warning: temperature approaching limit", file=sys.stderr) # Warning
print("<6>Info: processed 1000 frames", file=sys.stdout)            # Info

Query by priority:

# Show only errors and above
journalctl -u gpio-api.service -p err

This turns your Pi's journal into a queryable diagnostic record. When something goes wrong at 3 AM, you don't need to have been watching — the journal was.

✕ nohup / cron @reboot
  • No automatic restart on crash
  • Logs go to nohup.out or /dev/null
  • No memory or CPU limits
  • No dependency ordering
  • No watchdog supervision
  • Status check: is the PID still running?
✓ systemd service
  • Restart=always with configurable delay
  • Structured logging via journalctl
  • MemoryMax and CPUQuota enforcement
  • After= and Wants= for boot ordering
  • WatchdogSec catches hung processes
  • systemctl status with memory, uptime, and logs

Service Dependencies

Production Pi deployments often involve multiple services that depend on each other. Your Flask API needs MQTT. Your data processor needs the database. Systemd's dependency system handles this:

[Unit]
Description=Data processor
After=mosquitto.service postgresql.service
Requires=mosquitto.service
Wants=postgresql.service
  • After= controls start order: this service starts after Mosquitto and PostgreSQL
  • Requires= creates a hard dependency: if Mosquitto stops, this service stops too
  • Wants= creates a soft dependency: if PostgreSQL isn't available, this service still attempts to start

For the common pattern of a Pi running multiple services (sensor reader, API, dashboard), create a target that groups them:

# /etc/systemd/system/pi-stack.target
[Unit]
Description=Pi Application Stack
Requires=gpio-api.service sensor-reader.service
After=gpio-api.service sensor-reader.service

[Install]
WantedBy=multi-user.target

Now sudo systemctl start pi-stack.target brings up everything in the right order, and sudo systemctl stop pi-stack.target tears it down cleanly.

Security Hardening in the Service File

The service file from earlier included several security directives that deserve explanation:

NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/home/pi/gpio-api/data

NoNewPrivileges=true prevents the service (and any child process) from gaining additional privileges through setuid/setgid binaries. Even if an attacker compromises your Flask app, they can't escalate to root through a privilege-elevation exploit.

ProtectSystem=strict makes the entire filesystem read-only from the service's perspective. The service can't modify /usr, /boot, /etc, or any system directory. Combined with ReadWritePaths, you grant write access only to the specific directory your application needs — the data folder in this case.

ProtectHome=read-only prevents the service from writing to any home directory except what ReadWritePaths explicitly allows. This contains the blast radius: if the service is compromised, the attacker can't modify other users' files or plant backdoors in home directories.

These directives cost nothing in performance and significantly reduce the damage a compromised service can cause. Use them on every service file.

Practical Example: Full Service File for the Flask GPIO API

Here is the complete, copy-pasteable service file. Save it as /etc/systemd/system/gpio-api.service:

[Unit]
Description=GPIO Flask API — production
Documentation=man:gpio-api
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=pi
Group=pi
WorkingDirectory=/home/pi/gpio-api
Environment=FLASK_ENV=production
Environment=PYTHONUNBUFFERED=1
ExecStart=/home/pi/gpio-api/venv/bin/python app.py

Restart=always
RestartSec=5
StartLimitIntervalSec=60
StartLimitBurst=5

MemoryMax=256M
CPUQuota=50%

NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/home/pi/gpio-api/data

StandardOutput=journal
StandardError=journal
SyslogIdentifier=gpio-api

[Install]
WantedBy=multi-user.target

Deploy it:

sudo cp gpio-api.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable gpio-api.service
sudo systemctl start gpio-api.service
sudo systemctl status gpio-api.service

The StartLimitIntervalSec=60 and StartLimitBurst=5 directives define the crash-loop boundary: if the service restarts five times within sixty seconds, systemd stops trying and marks it as failed. This prevents a fundamentally broken service from consuming system resources in an infinite restart loop. When this happens, investigate — don't just increase the limit.

Key takeaway

A systemd service file is a contract between your application and the operating system. It specifies what to run, when to restart, how much resources to allow, and what to do when things go wrong. No other deployment method on a Pi gives you all of this in a single, declarative file.

What to Do Monday Morning

Pick one script running via nohup or cron and convert it

Find a Python script on your Pi that's running in a tmux session, a screen window, or a cron @reboot entry. Write a .service file for it using the template in this chapter. Enable it, start it, verify it with systemctl status. One conversion convinces you more than any amount of documentation.

Set Restart=always and test it

After your service is running, kill it manually: sudo kill $(pidof python) or whatever the process name is. Watch systemd restart it within five seconds. Then kill it three times in rapid succession and observe the start-limit behavior. Understanding how systemd handles crashes is understanding how your Pi handles 2 AM.

Add MemoryMax and CPUQuota

Set MemoryMax to a value that's reasonable for your application — monitor with systemctl status to see current usage, then set the limit to 2x that value. Set CPUQuota to prevent any single service from starving SSH and system processes. These two lines prevent the class of failure where a misbehaving application takes down the entire Pi.

Switch from print() to journalctl

Set StandardOutput=journal and PYTHONUNBUFFERED=1 in your service file. Replace any log-file writes with print() statements. Now journalctl -u your-service -f shows real-time logs with timestamps, and journalctl --since "1 hour ago" answers "what happened at 3 AM?" without grepping through log files.

Test a reboot

Run sudo reboot. Wait sixty seconds. SSH back in. Run systemctl status your-service. If the service is active and running, you have a production deployment. If it's not, read the journal: journalctl -u your-service -b. The answer is always in the logs.

The trap is that nohup feels simpler. It is simpler — in the same way that not wearing a seatbelt is simpler. Systemd is a fourteen-line configuration file that gives your application automatic restart, structured logging, resource limits, and boot-time dependency management. Nohup is a command that detaches a process from your terminal and hopes for the best. Production deserves better than hope.

Systemd is a fourteen-line configuration file that gives your application everything it needs to survive in production. Nohup is a hope and a prayer.