Homelab Monitoring

Why monitor a homelab

Without monitoring, you discover problems only when services break — a disk fills up, a container runs out of memory, or a drive fails silently. By then, the damage is done and recovery is reactive.

Monitoring shifts this from reactive to proactive:

Early warning — detect problems before they cause outages: disk filling up, CPU thermal throttling (the processor reducing its clock speed to avoid overheating), memory leaks in long-running services
Historical data — when something does break, time-series history lets you correlate the failure with what changed: a spike in disk I/O, a gradual memory climb, a sudden network throughput drop
Capacity planning — track resource usage trends over weeks and months to know when you need more RAM, another drive, or a second node

The metrics pipeline

Monitoring data flows through a pipeline from the machines being observed to a human-readable dashboard:

Targets (nodes, containers, apps)
  | expose /metrics endpoint (HTTP)
  v
Prometheus (scrapes targets every 15-60s)
  | stores time-series data in its TSDB
  v
Grafana (queries Prometheus via PromQL)
  | renders dashboards, evaluates alert rules
  v
Alertmanager / Notifications (email, Slack, PagerDuty)

Each layer has a distinct responsibility. Targets expose raw numbers. Prometheus collects and stores them. Grafana visualizes and alerts. Alertmanager routes and deduplicates notifications. This separation means you can swap any layer independently — replace Prometheus with VictoriaMetrics, or Grafana with Chronograf, without rebuilding the whole stack.

Prometheus

Prometheus is an open-source monitoring system and TSDB (Time-Series Database — a database optimized for storing and querying timestamped numerical data points). Originally built at SoundCloud, it is now a graduated project of the CNCF (Cloud Native Computing Foundation).

Pull-based scraping model

Most monitoring systems use one of two models: push (agents send data to a central collector) or pull (the collector fetches data from agents). Prometheus uses pull. At configured intervals (typically 15 to 60 seconds), Prometheus sends an HTTP GET request to each target’s /metrics endpoint and parses the response.

The scraping cycle works as follows:

Prometheus reads its configuration file (prometheus.yml), which lists scrape targets — either static IP:port pairs or dynamic service discovery (DNS-SD, Consul, Kubernetes API, etc.)
On each scrape interval, Prometheus opens an HTTP connection to http://<target>/metrics

The target responds with a plain-text body in Prometheus exposition format — one metric per line:

node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_memory_MemAvailable_bytes 8376492032
node_filesystem_avail_bytes{mountpoint="/",fstype="ext4"} 53687091200

Prometheus parses these lines, attaches a timestamp, and appends them to its local TSDB
The TSDB stores data as time-series, each identified by a metric name plus a set of key-value labels (e.g., node_cpu_seconds_total{cpu="0",mode="idle"})

Storage

Prometheus compresses time-series data into 2-hour blocks on disk, achieving roughly 1-2 bytes per sample. By default it retains 15 days of data, configurable via --storage.tsdb.retention.time. For longer retention, Prometheus supports remote write to external stores like Thanos or Cortex.

PromQL

PromQL (Prometheus Query Language) is the query language used to select, filter, and aggregate time-series data. Examples:

node_cpu_seconds_total{mode="idle"} — select all idle CPU time-series
rate(node_cpu_seconds_total{mode="idle"}[5m]) — per-second rate of change over a 5-minute window
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) — CPU usage percentage per host

Prometheus does not provide visualization. It has a basic expression browser for ad-hoc queries, but dashboards require a separate tool — typically Grafana.

Node Exporter

Node Exporter is a Prometheus exporter (an agent that translates system metrics into Prometheus exposition format) that runs on each Linux machine you want to monitor. It reads from /proc, /sys, and other kernel interfaces to expose hardware and OS-level metrics:

Metric category	What it measures	Example metric name
CPU	Per-core usage broken down by mode (user, system, idle, iowait)	`node_cpu_seconds_total`
Memory	Total, available, buffered, cached RAM	`node_memory_MemAvailable_bytes`
Disk I/O	Read/write bytes, operations, latency per device	`node_disk_read_bytes_total`
Filesystem	Total/available/used space per mount point	`node_filesystem_avail_bytes`
Network	Bytes/packets transmitted and received per interface	`node_network_receive_bytes_total`
Temperature	CPU and other sensor temperatures via hwmon	`node_hwmon_temp_celsius`

You install Node Exporter as a systemd service on each host. It listens on port 9100 by default. Prometheus is then configured to scrape http://<host>:9100/metrics.

Grafana

Grafana is an open-source visualization and observability platform. It does not store data itself — it connects to data sources (Prometheus, InfluxDB, Loki, Elasticsearch, PostgreSQL, and many others) and renders dashboards from their data.

How it works

You add a data source (e.g., your Prometheus instance at http://prometheus:9090)
You create a dashboard with panels — each panel contains a query in the data source’s native language (PromQL for Prometheus, LogQL for Loki, etc.)
Grafana executes the query against the data source and renders the result as a graph, table, gauge, heatmap, or other visualization
Dashboards auto-refresh at configurable intervals (e.g., every 30 seconds)

Community dashboards

Grafana maintains a public dashboard repository at grafana.com/grafana/dashboards. You can import pre-built dashboards by ID. For example, dashboard 1860 (“Node Exporter Full”) gives comprehensive host metrics out of the box — just point it at your Prometheus data source.

Alerting

Grafana supports alert rules defined against any data source. When a condition is met (e.g., CPU usage above 90% for 5 minutes), Grafana sends notifications through configured contact points: email, Slack, PagerDuty, Telegram, webhooks, and others.

Uptime Kuma

Uptime Kuma is a lightweight, self-hosted uptime monitoring tool. Where Prometheus + Grafana provides deep metrics, Uptime Kuma answers a simpler question: “is this service reachable?”

It supports multiple check types:

HTTP(S) — sends a request to a URL, checks for expected status code and optionally a keyword in the response body
TCP — attempts a TCP connection to a host:port (useful for databases, SMTP, custom services)
DNS — resolves a domain name and checks the result matches expectations
Ping — sends ICMP (Internet Control Message Protocol) echo requests to check host reachability

For each monitored target, Uptime Kuma tracks:

Current status (up/down)
Uptime percentage over configurable windows (24h, 7d, 30d)
Response time history with graphs
Incident history (when it went down, when it recovered)

Uptime Kuma runs as a single Node.js process with an embedded SQLite database. It provides a clean web UI and supports notifications via Slack, Telegram, Discord, email, and dozens of other channels. It is much simpler to set up and operate than a full Prometheus + Grafana stack, making it a good starting point for homelab monitoring.

Loki

Loki is a log aggregation system built by Grafana Labs. It fills the gap that Prometheus leaves: Prometheus handles metrics (numerical time-series), while Loki handles logs (textual event streams).

Why not just use Prometheus for logs?

Logs are unstructured or semi-structured text with highly variable cardinality. Indexing the full content of every log line (the way Elasticsearch does) requires significant storage and compute. Loki takes a different approach: it indexes only the metadata labels (e.g., {job="nginx", host="web01"}) and stores the log content as compressed chunks. When you query, Loki filters by labels first (fast index lookup), then scans only the matching chunks for your search string.

This design makes Loki much cheaper to operate than a full-text search engine while still being effective for troubleshooting.

How logs get into Loki

Loki does not pull logs the way Prometheus pulls metrics. Instead, agents push logs to Loki:

Promtail — the default Loki agent. Runs on each host, tails log files (e.g., /var/log/syslog, container logs), attaches labels, and pushes to Loki’s HTTP API
Grafana Alloy — the newer, unified collector from Grafana Labs that replaces Promtail, Node Exporter, and other agents with a single binary

Grafana queries Loki using LogQL (Log Query Language), which has a syntax similar to PromQL. You can view logs alongside metrics in the same Grafana dashboard.

Proxmox-specific monitoring

Proxmox VE has built-in monitoring in its web UI, but it is limited. Several tools extend it.

Built-in Proxmox monitoring

The Proxmox web interface shows real-time and short-history graphs for each node, VM, and container:

CPU usage (percentage)
RAM usage (used / total)
Disk read/write throughput
Network traffic (in/out per interface)

This is useful for quick checks but lacks long-term history, custom dashboards, and alerting.

ProxMenux

ProxMenux is a community-built tool that adds hardware-level monitoring to Proxmox that the default UI does not expose:

CPU and chipset temperatures from hardware sensors
SMART (Self-Monitoring, Analysis and Reporting Technology) disk health data — predicting drive failures before they happen
Docker container health status for Docker workloads running inside Proxmox VMs or LXC containers

Pulse

Pulse is a lightweight monitoring dashboard purpose-built for Proxmox environments. It provides detailed metrics for the host and all VMs/containers with built-in alert thresholds. Pulse is simpler to deploy than a full Prometheus + Grafana stack and serves as a middle ground between the basic Proxmox UI and a full observability platform.

Hardware monitoring essentials

Regardless of which monitoring stack you choose, these are the critical hardware metrics to track in a homelab:

Metric	Why it matters	How to collect
CPU temperature	Detect thermal throttling — when the CPU gets too hot, it reduces clock speed, causing performance degradation without any obvious error	Node Exporter (`node_hwmon_temp_celsius`), `lm-sensors`
SMART disk data	SMART (Self-Monitoring, Analysis and Reporting Technology) attributes like reallocated sectors, pending sectors, and wear leveling count predict drive failures days or weeks before they happen	`smartmontools`, Node Exporter with `--collector.diskstats`
RAM usage per VM/container	Identify memory leaks (gradual increase over days) and over-provisioning (VMs allocated 8 GB but using 2 GB)	Proxmox API, Node Exporter
Disk I/O latency	High latency (above ~10ms for SSDs, ~20ms for HDDs) indicates storage bottlenecks that degrade all workloads on that disk	Node Exporter (`node_disk_io_time_seconds_total`), `iostat`
Network bandwidth utilization	Saturated links cause packet drops and retransmissions — especially important if your homelab serves media (Plex/Jellyfin) or backups across the network	Node Exporter (`node_network_receive_bytes_total`)

Recommended progression

Not everything needs to be set up on day one. A practical progression for homelab monitoring:

Start (minimal effort, covers the basics)

Proxmox built-in monitoring for quick VM/container health checks
Uptime Kuma for “is it up?” checks on all your services
Time investment: ~30 minutes

Intermediate (per-host metrics and dashboards)

Prometheus as the central metrics store
Node Exporter on every host for hardware/OS metrics
Grafana with community dashboards (e.g., dashboard 1860 for Node Exporter)
Time investment: ~2-3 hours

Advanced (full observability)

Loki + Promtail/Alloy for centralized log aggregation
Alertmanager for routing alerts to Slack, email, or other channels with deduplication and silencing
SNMP (Simple Network Management Protocol) monitoring for network switches, routers, and access points — these devices cannot run Node Exporter, so Prometheus uses an SNMP exporter that translates SNMP OIDs (Object Identifiers — the hierarchical naming scheme SNMP uses to identify metrics on network devices) into Prometheus metrics
Time investment: ~1 day

Edmondo's Vault

Explorer

Homelab Monitoring

Homelab Monitoring

Why monitor a homelab

The metrics pipeline

Prometheus

Pull-based scraping model

Storage

PromQL

Node Exporter

Grafana

How it works

Community dashboards

Alerting

Uptime Kuma

Loki

Why not just use Prometheus for logs?

How logs get into Loki

Proxmox-specific monitoring

Built-in Proxmox monitoring

ProxMenux

Pulse

Hardware monitoring essentials

Recommended progression

Start (minimal effort, covers the basics)

Intermediate (per-host metrics and dashboards)

Advanced (full observability)

See also

Graph View

Table of Contents

Backlinks