Drawing and Annotation | The Raspberry Pi Masterclass

The trap is treating annotation as an afterthought — something you bolt on at the end to make a demo look good. In practice, annotations are the interface between your computer vision system and the humans who depend on it. A security camera that detects motion but doesn't draw a bounding box around the intruder is useless to the guard watching the feed. A quality control system that flags defects but doesn't highlight where on the product the defect exists creates more work than it saves.

Annotation is engineering, not decoration. And the first rule of that engineering is one that most tutorials skip entirely.

Think of it this way: your detection algorithm sees the world in arrays of numbers. Humans see the world in shapes and colors. The annotation layer is the translation between those two representations. A bounding box at coordinates (120, 80, 200, 160) means nothing to a human looking at a video feed. A green rectangle with a label reading "Person: 94%" means everything. The quality of that translation — its clarity, its consistency, its information density — determines whether your system is useful or merely functional.

A detection system without clear annotation is a black box. The annotation layer is what makes the system's decisions auditable by humans.

Never Draw on Your Source Data

This is the most consequential drawing rule in computer vision, and I see it violated in nearly every beginner tutorial.

Framework · The Annotation Layer · Draw on a copy, never the original

Keep your annotations separate from your source data. Draw on a copy of the frame, never the original. When you overwrite the source frame with rectangles, text, and circles, you corrupt every downstream operation that needs the clean pixel data. Detection, classification, saving, streaming — all of them now include your annotations as part of the image.

import cv2

frame = cv2.imread('/home/pi/photo.jpg')

# WRONG: drawing directly on the source
# cv2.rectangle(frame, (100, 50), (300, 200), (0, 255, 0), 2)
# Now 'frame' is permanently modified — detection code sees the rectangle

# RIGHT: draw on a copy
display = frame.copy()
cv2.rectangle(display, (100, 50), (300, 200), (0, 255, 0), 2)
# 'frame' is still clean for processing; 'display' has the annotation

The cost of .copy() on a 640x480 frame is negligible — about 0.3 ms on a Pi 4. The cost of debugging why your face detector suddenly finds rectangular faces is an afternoon you don't get back. I treat this the same way I treat immutability in functional programming — the original data is sacred. You can derive new views, transformations, and annotations from it, but the source stays untouched. On a Pi, where you can't afford to re-capture a frame you've already processed, this discipline is even more important than on a desktop where you can just re-run the script.

I've seen this pattern where a team builds a real-time detection pipeline, draws bounding boxes on the frame, and then feeds that annotated frame back into the detector on the next iteration. The detector starts finding the bounding boxes themselves as objects. The system hallucinates detections of its own annotations, which generates more annotations, which generates more hallucinated detections. It's a feedback loop, and it's surprisingly hard to diagnose because the symptoms look like a sensitivity problem, not a data corruption problem.

Key takeaway

Always maintain two references: one for processing (clean, unmodified) and one for display (annotated with results). The cost of a copy is trivial. The cost of contaminated data is catastrophic.

Rectangles: The Bounding Box Workhorse

Bounding boxes are the most common annotation in computer vision. Object detection, face detection, motion zones — they all output rectangles.

import cv2

img = cv2.imread('/home/pi/photo.jpg')
display = img.copy()

# Basic rectangle: top-left corner, bottom-right corner, color (BGR), thickness
cv2.rectangle(display, (100, 50), (300, 200), (0, 255, 0), 2)

# Filled rectangle (thickness = -1)
cv2.rectangle(display, (350, 50), (500, 100), (0, 255, 0), -1)

# Common pattern: semi-transparent overlay for a label background
overlay = display.copy()
cv2.rectangle(overlay, (100, 30), (300, 55), (0, 255, 0), -1)
cv2.addWeighted(overlay, 0.6, display, 0.4, 0, display)

The color tuple is BGR — Blue, Green, Red. Common colors you'll reach for:

GREEN  = (0, 255, 0)    # Positive detection
RED    = (0, 0, 255)    # Warning / negative
BLUE   = (255, 0, 0)    # Neutral / informational
YELLOW = (0, 255, 255)  # Caution
WHITE  = (255, 255, 255)
BLACK  = (0, 0, 0)

Color-code your annotations semantically

Green for confirmed detections, red for alerts or errors, yellow for uncertain results. Consistent color coding across your application lets operators glance at a feed and immediately understand the system's state without reading labels.

Circles, Lines, and Shapes

Beyond rectangles, you'll need circles for keypoints, lines for connections and boundaries, and occasionally polygons for irregular regions.

import cv2
import numpy as np

img = cv2.imread('/home/pi/photo.jpg')
display = img.copy()

# Circle: center (x, y), radius, color, thickness
cv2.circle(display, (320, 240), 50, (0, 0, 255), 2)

# Filled circle (for keypoints / landmarks)
cv2.circle(display, (320, 240), 5, (0, 255, 0), -1)

# Line: start point, end point, color, thickness
cv2.line(display, (100, 100), (400, 300), (255, 0, 0), 2)

# Anti-aliased line (smoother, slightly slower)
cv2.line(display, (100, 300), (400, 100), (255, 0, 0), 2, cv2.LINE_AA)

# Polyline: draw a polygon outline
pts = np.array([[200, 100], [350, 100], [400, 250], [150, 250]], np.int32)
pts = pts.reshape((-1, 1, 2))
cv2.polylines(display, [pts], isClosed=True, color=(0, 255, 255), thickness=2)

LINE_AA for final output, LINE_8 for real-time

Anti-aliased drawing (cv2.LINE_AA) produces smoother edges but costs more CPU time. For real-time video pipelines on a Pi, use the default cv2.LINE_8 (aliased) during processing and only switch to LINE_AA if you're saving annotated frames for reports or presentations.

Text Rendering: Building Information Displays

cv2.putText() is the function you'll call more than any other drawing function. Labels, FPS counters, timestamps, detection confidence — all of them are text overlays.

import cv2

img = cv2.imread('/home/pi/photo.jpg')
display = img.copy()

# Basic text: image, text, origin (bottom-left of text), font, scale, color, thickness
cv2.putText(
    display,
    "Detected: Face",
    (110, 45),                      # Position (x, y) — bottom-left of text
    cv2.FONT_HERSHEY_SIMPLEX,       # Font face
    0.7,                            # Font scale
    (0, 255, 0),                    # Color (BGR)
    2,                              # Thickness
    cv2.LINE_AA                     # Anti-aliased
)

The font faces you'll actually use:

FONT_HERSHEY_SIMPLEX — clean sans-serif, my default for labels
FONT_HERSHEY_DUPLEX — slightly heavier version for headers
FONT_HERSHEY_PLAIN — smaller, lighter, good for dense information
FONT_HERSHEY_COMPLEX — serif font, rarely useful on small frames

The position coordinate in putText is the bottom-left corner of the text, not the top-left. Every developer gets this wrong the first time and places text one line height too low.

Font scale is relative, not in pixels. A scale of 1.0 at FONT_HERSHEY_SIMPLEX produces text roughly 22 pixels tall. At 0.5, roughly 11 pixels. You'll adjust scale based on your frame resolution — text that's readable at 1080p becomes microscopic at 480p.

This resolution dependency is a practical headache. If your pipeline processes at 480p but saves annotated frames at 1080p, text that looked fine during development at 480p looks tiny on the final output. I handle this by computing font scale relative to the frame height:

# Scale text relative to frame height
h = frame.shape[0]
font_scale = h / 800  # Produces 0.6 at 480p, 1.35 at 1080p
thickness = max(1, int(h / 400))

This formula isn't magic — it's just a ratio that produces readable text at common resolutions. Adjust the divisor (800, 400) once for your preferred text size, then forget about it. The text scales automatically when you change resolution.

To position text precisely, measure it first:

text = "Person: 94%"
font = cv2.FONT_HERSHEY_SIMPLEX
scale = 0.6
thickness = 2

# Get the text size (width, height) and baseline
(text_w, text_h), baseline = cv2.getTextSize(text, font, scale, thickness)

# Draw a background rectangle behind the text
x, y = 100, 50
cv2.rectangle(display, (x, y - text_h - baseline), (x + text_w, y + baseline), (0, 0, 0), -1)
cv2.putText(display, text, (x, y), font, scale, (0, 255, 0), thickness, cv2.LINE_AA)

This technique — measuring the text, drawing a filled rectangle behind it, then drawing the text on top — is how you create readable labels that work against any background. Without the background rectangle, white text disappears against bright regions and dark text vanishes against shadows.

Building a HUD Overlay

A heads-up display (HUD) overlay combines multiple text elements into a persistent information panel on your video feed. This is what turns a raw camera stream into a monitoring system.

import cv2
import time
from datetime import datetime

class HUD:
    def __init__(self):
        self.font = cv2.FONT_HERSHEY_SIMPLEX
        self.small_font = cv2.FONT_HERSHEY_PLAIN
        self.fps_timestamps = []

    def update_fps(self):
        now = time.perf_counter()
        self.fps_timestamps.append(now)
        self.fps_timestamps = self.fps_timestamps[-30:]
        if len(self.fps_timestamps) < 2:
            return 0.0
        elapsed = self.fps_timestamps[-1] - self.fps_timestamps[0]
        return (len(self.fps_timestamps) - 1) / elapsed if elapsed > 0 else 0.0

    def draw(self, frame, detections=None, status="MONITORING"):
        display = frame.copy()
        h, w = display.shape[:2]

        fps = self.update_fps()

        # Top bar: timestamp + FPS
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        cv2.rectangle(display, (0, 0), (w, 30), (0, 0, 0), -1)
        cv2.putText(display, timestamp, (10, 22),
                    self.small_font, 1.2, (200, 200, 200), 1, cv2.LINE_AA)
        cv2.putText(display, f"FPS: {fps:.1f}", (w - 120, 22),
                    self.small_font, 1.2, (0, 255, 0), 1, cv2.LINE_AA)

        # Bottom bar: status + detection count
        det_count = len(detections) if detections else 0
        cv2.rectangle(display, (0, h - 30), (w, h), (0, 0, 0), -1)

        status_color = (0, 255, 0) if status == "MONITORING" else (0, 0, 255)
        cv2.putText(display, status, (10, h - 8),
                    self.small_font, 1.2, status_color, 1, cv2.LINE_AA)
        cv2.putText(display, f"Detections: {det_count}", (w - 160, h - 8),
                    self.small_font, 1.2, (200, 200, 200), 1, cv2.LINE_AA)

        # Draw detection bounding boxes
        if detections:
            for det in detections:
                x, y, bw, bh = det['box']
                label = f"{det['label']}: {det['confidence']:.0%}"
                color = (0, 255, 0) if det['confidence'] > 0.7 else (0, 255, 255)

                cv2.rectangle(display, (x, y), (x + bw, y + bh), color, 2)

                (tw, th), _ = cv2.getTextSize(label, self.small_font, 1.0, 1)
                cv2.rectangle(display, (x, y - th - 6), (x + tw + 4, y), color, -1)
                cv2.putText(display, label, (x + 2, y - 4),
                            self.small_font, 1.0, (0, 0, 0), 1, cv2.LINE_AA)

        return display

Use it in a capture loop:

hud = HUD()

while True:
    frame = picam2.capture_array()
    bgr = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)

    # Your detection code returns a list of dicts
    detections = [
        {'box': (150, 100, 120, 160), 'label': 'Face', 'confidence': 0.92},
    ]

    annotated = hud.draw(bgr, detections=detections)
    cv2.imwrite('/home/pi/latest_frame.jpg', annotated)

Key takeaway

Annotations aren't cosmetic — they're the interface layer that makes a detection system useful to humans. Build them as carefully as you build the detection code itself.

Annotating Detected Regions: The Full Pattern

Here's the complete annotation pattern I use for every detection pipeline. It combines bounding boxes, labels with confidence scores, and a background rectangle for readability:

import cv2

def annotate_detections(frame, detections):
    """
    Annotate a frame with detection results.

    Args:
        frame: BGR image (NumPy array)
        detections: list of dicts with 'box' (x, y, w, h),
                    'label' (str), 'confidence' (float 0-1)

    Returns:
        Annotated copy of the frame
    """
    display = frame.copy()
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.5
    thickness = 1

    for det in detections:
        x, y, w, h = det['box']
        label = det['label']
        conf = det['confidence']
        text = f"{label}: {conf:.0%}"

        # Color based on confidence
        if conf >= 0.8:
            color = (0, 255, 0)   # Green: high confidence
        elif conf >= 0.5:
            color = (0, 255, 255) # Yellow: medium confidence
        else:
            color = (0, 0, 255)   # Red: low confidence

        # Bounding box
        cv2.rectangle(display, (x, y), (x + w, y + h), color, 2)

        # Label background
        (tw, th), baseline = cv2.getTextSize(text, font, font_scale, thickness)
        cv2.rectangle(display, (x, y - th - baseline - 4), (x + tw + 4, y), color, -1)

        # Label text (black on colored background for readability)
        cv2.putText(display, text, (x + 2, y - baseline - 2),
                    font, font_scale, (0, 0, 0), thickness, cv2.LINE_AA)

    return display

This function takes clean detection results and produces a clean annotated frame. The detection code never sees the annotations. The annotation code never modifies the source. Two concerns, cleanly separated.

The confidence-based color coding is more than decoration — it's a visual triage system. An operator watching a security feed can immediately distinguish high-confidence detections (green, probably real) from uncertain ones (yellow, worth a second look) from low-confidence triggers (red, likely noise). Without this coding, every detection looks equally important, which means none of them are actionable at a glance.

I've seen this pattern where teams build a detection system that draws identical green boxes around every detection regardless of confidence. The operator has to read the label text on each box to decide which are real. At 5 detections per frame, that's manageable. At 50, it's impossible. Color-coded confidence turns the annotation layer into a filtering system that works at the speed of human vision — you see the colors before you read the text.

✓

What to Do Monday Morning

Draw every shape on a test image

Load any image and draw a rectangle, circle, line, and text on it using the functions from this chapter. Save the result. Pay attention to the coordinate system — position each element deliberately, not randomly. This builds the muscle memory for annotation code.

Build the label-with-background pattern

Write a function that takes an image, a position, and a label string, then draws the text with a filled rectangle behind it. Use cv2.getTextSize() to measure the text first. This three-line pattern (measure, draw background, draw text) is the most reused code in annotation.

Implement the HUD class

Copy the HUD class from this chapter into a file called hud.py. Connect it to a camera feed (USB or Pi Camera) and run it for five minutes. Watch the FPS counter. Add the timestamp display. This is the skeleton of every monitoring system you'll build.

Prove the copy rule to yourself

Write a script that draws a bright green rectangle on a frame, then runs edge detection on the same frame. Look at the edge map — you'll see the rectangle's edges as detected features. Then modify the script to draw on a .copy() and run edge detection on the original. Compare the two edge maps. The difference is the entire argument for this chapter.

Build the annotate_detections function

Implement the full annotation function with color-coded confidence levels. Feed it mock detection data (hardcoded bounding boxes) and verify the output looks correct. This function will be the display layer for every detection pipeline in the next chapters.