Homelab Monitoring
Why monitor a homelab
Without monitoring, you discover problems only when services break — a disk fills up, a container runs out of memory, or a drive fails silently. By then, the damage is done and recovery is reactive.
Monitoring shifts this from reactive to proactive:
- Early warning — detect problems before they cause outages: disk filling up, CPU thermal throttling (the processor reducing its clock speed to avoid overheating), memory leaks in long-running services
- Historical data — when something does break, time-series history lets you correlate the failure with what changed: a spike in disk I/O, a gradual memory climb, a sudden network throughput drop
- Capacity planning — track resource usage trends over weeks and months to know when you need more RAM, another drive, or a second node
The metrics pipeline
Monitoring data flows through a pipeline from the machines being observed to a human-readable dashboard:
Targets (nodes, containers, apps)
| expose /metrics endpoint (HTTP)
v
Prometheus (scrapes targets every 15-60s)
| stores time-series data in its TSDB
v
Grafana (queries Prometheus via PromQL)
| renders dashboards, evaluates alert rules
v
Alertmanager / Notifications (email, Slack, PagerDuty)
Each layer has a distinct responsibility. Targets expose raw numbers. Prometheus collects and stores them. Grafana visualizes and alerts. Alertmanager routes and deduplicates notifications. This separation means you can swap any layer independently — replace Prometheus with VictoriaMetrics, or Grafana with Chronograf, without rebuilding the whole stack.
Prometheus
Prometheus is an open-source monitoring system and TSDB (Time-Series Database — a database optimized for storing and querying timestamped numerical data points). Originally built at SoundCloud, it is now a graduated project of the CNCF (Cloud Native Computing Foundation).
Pull-based scraping model
Most monitoring systems use one of two models: push (agents send data to a central collector) or pull (the collector fetches data from agents). Prometheus uses pull. At configured intervals (typically 15 to 60 seconds), Prometheus sends an HTTP GET request to each target’s /metrics endpoint and parses the response.
The scraping cycle works as follows:
- Prometheus reads its configuration file (
prometheus.yml), which lists scrape targets — either static IP:port pairs or dynamic service discovery (DNS-SD, Consul, Kubernetes API, etc.) - On each scrape interval, Prometheus opens an HTTP connection to
http://<target>/metrics - The target responds with a plain-text body in Prometheus exposition format — one metric per line:
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78 node_memory_MemAvailable_bytes 8376492032 node_filesystem_avail_bytes{mountpoint="/",fstype="ext4"} 53687091200 - Prometheus parses these lines, attaches a timestamp, and appends them to its local TSDB
- The TSDB stores data as time-series, each identified by a metric name plus a set of key-value labels (e.g.,
node_cpu_seconds_total{cpu="0",mode="idle"})
Storage
Prometheus compresses time-series data into 2-hour blocks on disk, achieving roughly 1-2 bytes per sample. By default it retains 15 days of data, configurable via --storage.tsdb.retention.time. For longer retention, Prometheus supports remote write to external stores like Thanos or Cortex.
PromQL
PromQL (Prometheus Query Language) is the query language used to select, filter, and aggregate time-series data. Examples:
node_cpu_seconds_total{mode="idle"}— select all idle CPU time-seriesrate(node_cpu_seconds_total{mode="idle"}[5m])— per-second rate of change over a 5-minute window100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)— CPU usage percentage per host
Prometheus does not provide visualization. It has a basic expression browser for ad-hoc queries, but dashboards require a separate tool — typically Grafana.
Node Exporter
Node Exporter is a Prometheus exporter (an agent that translates system metrics into Prometheus exposition format) that runs on each Linux machine you want to monitor. It reads from /proc, /sys, and other kernel interfaces to expose hardware and OS-level metrics:
| Metric category | What it measures | Example metric name |
|---|---|---|
| CPU | Per-core usage broken down by mode (user, system, idle, iowait) | node_cpu_seconds_total |
| Memory | Total, available, buffered, cached RAM | node_memory_MemAvailable_bytes |
| Disk I/O | Read/write bytes, operations, latency per device | node_disk_read_bytes_total |
| Filesystem | Total/available/used space per mount point | node_filesystem_avail_bytes |
| Network | Bytes/packets transmitted and received per interface | node_network_receive_bytes_total |
| Temperature | CPU and other sensor temperatures via hwmon | node_hwmon_temp_celsius |
You install Node Exporter as a systemd service on each host. It listens on port 9100 by default. Prometheus is then configured to scrape http://<host>:9100/metrics.
Grafana
Grafana is an open-source visualization and observability platform. It does not store data itself — it connects to data sources (Prometheus, InfluxDB, Loki, Elasticsearch, PostgreSQL, and many others) and renders dashboards from their data.
How it works
- You add a data source (e.g., your Prometheus instance at
http://prometheus:9090) - You create a dashboard with panels — each panel contains a query in the data source’s native language (PromQL for Prometheus, LogQL for Loki, etc.)
- Grafana executes the query against the data source and renders the result as a graph, table, gauge, heatmap, or other visualization
- Dashboards auto-refresh at configurable intervals (e.g., every 30 seconds)
Community dashboards
Grafana maintains a public dashboard repository at grafana.com/grafana/dashboards. You can import pre-built dashboards by ID. For example, dashboard 1860 (“Node Exporter Full”) gives comprehensive host metrics out of the box — just point it at your Prometheus data source.
Alerting
Grafana supports alert rules defined against any data source. When a condition is met (e.g., CPU usage above 90% for 5 minutes), Grafana sends notifications through configured contact points: email, Slack, PagerDuty, Telegram, webhooks, and others.
Uptime Kuma
Uptime Kuma is a lightweight, self-hosted uptime monitoring tool. Where Prometheus + Grafana provides deep metrics, Uptime Kuma answers a simpler question: “is this service reachable?”
It supports multiple check types:
- HTTP(S) — sends a request to a URL, checks for expected status code and optionally a keyword in the response body
- TCP — attempts a TCP connection to a host:port (useful for databases, SMTP, custom services)
- DNS — resolves a domain name and checks the result matches expectations
- Ping — sends ICMP (Internet Control Message Protocol) echo requests to check host reachability
For each monitored target, Uptime Kuma tracks:
- Current status (up/down)
- Uptime percentage over configurable windows (24h, 7d, 30d)
- Response time history with graphs
- Incident history (when it went down, when it recovered)
Uptime Kuma runs as a single Node.js process with an embedded SQLite database. It provides a clean web UI and supports notifications via Slack, Telegram, Discord, email, and dozens of other channels. It is much simpler to set up and operate than a full Prometheus + Grafana stack, making it a good starting point for homelab monitoring.
Loki
Loki is a log aggregation system built by Grafana Labs. It fills the gap that Prometheus leaves: Prometheus handles metrics (numerical time-series), while Loki handles logs (textual event streams).
Why not just use Prometheus for logs?
Logs are unstructured or semi-structured text with highly variable cardinality. Indexing the full content of every log line (the way Elasticsearch does) requires significant storage and compute. Loki takes a different approach: it indexes only the metadata labels (e.g., {job="nginx", host="web01"}) and stores the log content as compressed chunks. When you query, Loki filters by labels first (fast index lookup), then scans only the matching chunks for your search string.
This design makes Loki much cheaper to operate than a full-text search engine while still being effective for troubleshooting.
How logs get into Loki
Loki does not pull logs the way Prometheus pulls metrics. Instead, agents push logs to Loki:
- Promtail — the default Loki agent. Runs on each host, tails log files (e.g.,
/var/log/syslog, container logs), attaches labels, and pushes to Loki’s HTTP API - Grafana Alloy — the newer, unified collector from Grafana Labs that replaces Promtail, Node Exporter, and other agents with a single binary
Grafana queries Loki using LogQL (Log Query Language), which has a syntax similar to PromQL. You can view logs alongside metrics in the same Grafana dashboard.
Proxmox-specific monitoring
Proxmox VE has built-in monitoring in its web UI, but it is limited. Several tools extend it.
Built-in Proxmox monitoring
The Proxmox web interface shows real-time and short-history graphs for each node, VM, and container:
- CPU usage (percentage)
- RAM usage (used / total)
- Disk read/write throughput
- Network traffic (in/out per interface)
This is useful for quick checks but lacks long-term history, custom dashboards, and alerting.
ProxMenux
ProxMenux is a community-built tool that adds hardware-level monitoring to Proxmox that the default UI does not expose:
- CPU and chipset temperatures from hardware sensors
- SMART (Self-Monitoring, Analysis and Reporting Technology) disk health data — predicting drive failures before they happen
- Docker container health status for Docker workloads running inside Proxmox VMs or LXC containers
Pulse
Pulse is a lightweight monitoring dashboard purpose-built for Proxmox environments. It provides detailed metrics for the host and all VMs/containers with built-in alert thresholds. Pulse is simpler to deploy than a full Prometheus + Grafana stack and serves as a middle ground between the basic Proxmox UI and a full observability platform.
Hardware monitoring essentials
Regardless of which monitoring stack you choose, these are the critical hardware metrics to track in a homelab:
| Metric | Why it matters | How to collect |
|---|---|---|
| CPU temperature | Detect thermal throttling — when the CPU gets too hot, it reduces clock speed, causing performance degradation without any obvious error | Node Exporter (node_hwmon_temp_celsius), lm-sensors |
| SMART disk data | SMART (Self-Monitoring, Analysis and Reporting Technology) attributes like reallocated sectors, pending sectors, and wear leveling count predict drive failures days or weeks before they happen | smartmontools, Node Exporter with --collector.diskstats |
| RAM usage per VM/container | Identify memory leaks (gradual increase over days) and over-provisioning (VMs allocated 8 GB but using 2 GB) | Proxmox API, Node Exporter |
| Disk I/O latency | High latency (above ~10ms for SSDs, ~20ms for HDDs) indicates storage bottlenecks that degrade all workloads on that disk | Node Exporter (node_disk_io_time_seconds_total), iostat |
| Network bandwidth utilization | Saturated links cause packet drops and retransmissions — especially important if your homelab serves media (Plex/Jellyfin) or backups across the network | Node Exporter (node_network_receive_bytes_total) |
Recommended progression
Not everything needs to be set up on day one. A practical progression for homelab monitoring:
Start (minimal effort, covers the basics)
- Proxmox built-in monitoring for quick VM/container health checks
- Uptime Kuma for “is it up?” checks on all your services
- Time investment: ~30 minutes
Intermediate (per-host metrics and dashboards)
- Prometheus as the central metrics store
- Node Exporter on every host for hardware/OS metrics
- Grafana with community dashboards (e.g., dashboard 1860 for Node Exporter)
- Time investment: ~2-3 hours
Advanced (full observability)
- Loki + Promtail/Alloy for centralized log aggregation
- Alertmanager for routing alerts to Slack, email, or other channels with deduplication and silencing
- SNMP (Simple Network Management Protocol) monitoring for network switches, routers, and access points — these devices cannot run Node Exporter, so Prometheus uses an SNMP exporter that translates SNMP OIDs (Object Identifiers — the hierarchical naming scheme SNMP uses to identify metrics on network devices) into Prometheus metrics
- Time investment: ~1 day