Ceph Distributed Storage
Why distributed storage
A single disk fails. A single server fails. When either happens, data is lost unless there is a copy somewhere else. Traditional solutions — mirroring disks inside one machine (RAID), or backing up to a second machine — work at small scale, but they hit hard limits:
- Capacity: a single server has a finite number of drive bays. Once you need tens or hundreds of terabytes, you must span multiple machines.
- Throughput: one server’s network and bus bandwidth caps total I/O. Spreading data across nodes lets clients read and write in parallel.
- Durability: if the entire server (power supply, motherboard, backplane) dies, local RAID does not help. You need replicas on physically separate machines.
- Availability: maintenance, firmware updates, and hardware swaps take nodes offline. A distributed system continues serving data while individual nodes are down.
Distributed storage solves all four by treating a cluster of commodity servers as a single storage pool. Data is automatically spread across nodes, replicated for durability, and rebalanced when nodes join or leave.
What is Ceph
Ceph is an open-source distributed storage system that provides three storage interfaces — object, block, and file — on top of a single unified cluster. It was created by Sage Weil as part of his PhD thesis at the University of California, Santa Cruz, in 2006. Ceph is now a project under the Linux Foundation and is widely deployed in OpenStack clouds, Kubernetes persistent volume backends, and enterprise storage environments.
Key properties:
- No single point of failure: every component can be replicated; there is no central metadata server for the core object layer.
- Self-healing: when a disk or node fails, the cluster automatically re-replicates data to restore the configured redundancy level.
- Horizontal scaling: adding nodes increases both capacity and throughput. The system rebalances automatically.
- Software-defined: runs on commodity hardware. No proprietary storage controllers.
Architecture — the RADOS layer
Everything in Ceph is built on RADOS (Reliable Autonomic Distributed Object Store). RADOS is the core storage layer. Regardless of whether you use Ceph for block devices, filesystems, or S3-compatible object storage, the data ultimately lives as objects inside RADOS.
An object in RADOS is a blob of data (up to several megabytes) plus a key (the object name) plus optional metadata (key-value pairs called xattrs — extended attributes).
Component overview
+-----------+ +-----------+ +-----------+
| RBD | | CephFS | | RGW |
| (Block) | | (File) | | (Object) |
+-----+-----+ +-----+-----+ +-----+-----+
| | |
+-------+-------+-------+-------+
| |
v v
+----------------------------+
| librados |
| (RADOS client library) |
+-------------+--------------+
|
+---------------------+---------------------+
| | |
v v v
+------+------+ +------+------+ +------+------+
| MON | | OSD | | MDS |
| (Monitor) | | (Object Store| | (Metadata |
| | | Daemon) | | Server) |
+--------------+ +--------------+ +-------------+
|
v
+------+------+
| MGR |
| (Manager) |
+--------------+
MON — Monitor
MON (Monitor) daemons maintain the cluster’s authoritative state. They store and distribute the cluster map, which is actually a collection of maps:
| Map | Purpose |
|---|---|
| Monitor map | List of MON daemons, their addresses, and the current epoch (version number) |
| OSD map | List of all OSD (Object Storage Daemon) daemons, their status (up/down, in/out), and weight |
| PG map | PG (Placement Group) states, version numbers, and statistics |
| CRUSH map | Cluster topology (racks, rows, hosts, disks) plus placement rules |
Monitors use the Paxos consensus algorithm to agree on the current state. Because Paxos requires a majority to make progress, you must run an odd number of monitors — typically 3 or 5. Monitors also handle client authentication using the cephx protocol (a shared-secret scheme similar in spirit to Kerberos).
Monitors are lightweight. They do not handle data I/O — they only provide the map that lets clients and OSDs find each other.
OSD — Object Storage Daemon
An OSD (Object Storage Daemon) manages a single physical storage device — one spinning disk or one SSD (Solid-State Drive) or one NVME drive. A server with 10 drives runs 10 independent OSD processes.
Each OSD is responsible for:
- Storing objects: writing them to its local device (historically using a filesystem like XFS; modern Ceph uses BlueStore, which writes directly to the raw block device, bypassing the kernel filesystem entirely).
- Replicating: the primary OSD for a PG pushes writes to the secondary and tertiary OSDs.
- Heartbeating: OSDs ping each other. If an OSD stops responding, its peers report it to the monitors.
- Recovering: when a failed OSD comes back, or when a new OSD is added, data is redistributed automatically.
- Scrubbing: periodically reading all objects and comparing checksums across replicas to detect bit-rot (silent data corruption).
A healthy cluster might run hundreds or thousands of OSD daemons.
MDS — Metadata Server
MDS (Metadata Server) daemons are required only when using CephFS (the POSIX-compatible filesystem interface). They cache and manage filesystem metadata: the directory tree, file names, permissions, timestamps, and lock state.
MDS does not store file data — that still lives on OSDs. It only handles the metadata operations that POSIX demands (stat, readdir, rename, chmod). Block storage (RBD) and object storage (RGW) do not use MDS at all.
Multiple MDS daemons can run for high availability and to partition the namespace for performance (a feature called dynamic subtree partitioning — different MDS daemons handle different parts of the directory tree).
MGR — Manager
MGR (Manager) daemons handle monitoring, telemetry, and orchestration tasks that were historically bolted onto MON daemons. Responsibilities include:
- Exposing a Prometheus endpoint for metric scraping
- Running the Ceph Dashboard (a web UI for cluster management)
- Providing a REST API
- Managing device health telemetry (SMART data from disks)
- Orchestrating deployments via cephadm or Rook
Typically two MGR daemons run: one active, one standby.
MON vs MGR — why two management daemons?
Early Ceph (before the Luminous release in 2017) had no MGR daemon. MON daemons handled everything: cluster state consensus, metrics collection, health checks, and the web dashboard. As clusters grew, MON daemons became overloaded — the lightweight consensus protocol (Paxos) that keeps cluster state consistent is sensitive to latency, and heavy metric collection or dashboard rendering on the same process could delay map updates and OSD heartbeat processing.
The MGR daemon was introduced to split responsibilities along a clear boundary:
| Concern | MON (consensus-critical) | MGR (operational) |
|---|---|---|
| Cluster map (authoritative state) | Yes — owns and distributes maps | No — reads maps from MON |
| Paxos consensus | Yes — participates in quorum | No — not part of quorum |
| Client authentication (cephx) | Yes | No |
| Metrics collection and aggregation | No | Yes — scrapes OSDs, exposes Prometheus endpoint |
| Web dashboard | No | Yes — runs the Ceph Dashboard module |
| Orchestration (deploying OSDs, etc.) | No | Yes — runs cephadm/Rook modules |
| Device health (SMART telemetry) | No | Yes |
| Number required | Odd (3 or 5) for quorum | 2 (one active, one standby) |
| Failure impact | Loss of quorum halts map updates — cluster cannot process topology changes | Dashboard and metrics go offline, but data I/O continues unaffected |
The design principle: MON must be fast, lightweight, and never stall on non-consensus work. Everything that is not strictly about maintaining cluster state lives in MGR, where a crash or slow module cannot delay Paxos rounds.
The CRUSH algorithm
The problem with lookup tables
A naive distributed storage system would maintain a central table mapping every object to the disk that stores it. This has three problems:
- Bottleneck: every read and write must consult the table.
- Scale: the table grows linearly with the number of objects (potentially billions).
- Single point of failure: if the table is lost, you cannot find anything.
How CRUSH works
CRUSH (Controlled Replication Under Scalable Hashing) replaces the lookup table with a deterministic, pseudo-random algorithm. Given an object name and the current cluster topology (the CRUSH map), any node — client or server — can independently compute which OSDs store that object. No central lookup is needed.
The computation proceeds in two stages:
Stage 1 — Object to PG:
pg_id = hash(object_name) mod num_placement_groups
The object name is hashed (using CRUSH’s hash function, which produces a 32-bit value), and the result is taken modulo the number of PGs (Placement Groups) in the pool. This maps the object to exactly one PG.
Stage 2 — PG to OSDs:
osd_set = CRUSH(pg_id, crush_map, placement_rule)
CRUSH takes the PG id, the cluster topology tree (the CRUSH map), and a placement rule (e.g., “pick 3 OSDs, each on a different host”) and deterministically selects a set of OSDs. The first OSD in the list is the primary — it coordinates writes and serves reads.
The CRUSH map as a topology tree
The CRUSH map is a hierarchical description of the cluster’s physical layout:
root default
+-- rack rack1
| +-- host node-01
| | +-- osd.0 (ssd, 1TB)
| | +-- osd.1 (ssd, 1TB)
| | +-- osd.2 (hdd, 8TB)
| +-- host node-02
| +-- osd.3 (ssd, 1TB)
| +-- osd.4 (hdd, 8TB)
+-- rack rack2
+-- host node-03
| +-- osd.5 (ssd, 1TB)
| +-- osd.6 (hdd, 8TB)
+-- host node-04
+-- osd.7 (ssd, 1TB)
+-- osd.8 (hdd, 8TB)
Placement rules reference this hierarchy. For example, a rule might say: “select 3 OSDs, each from a different host.” Another rule might say: “select 3 OSDs, each from a different rack” — providing rack-level fault tolerance. The hierarchy can include datacenters, rooms, rows, racks, and hosts.
Why CRUSH matters
Because CRUSH is deterministic and stateless, clients compute placement locally. Adding or removing an OSD changes the CRUSH map, but only a fraction of PGs need to move (roughly 1/n of the data, where n is the number of OSDs). This is far better than a hash table resize, which would move most of the data.
Placement Groups (PGs)
Objects are not mapped directly to OSDs. Instead, a two-level indirection is used:
Object ---[hash mod]--> PG ---[CRUSH]--> OSD set
Why the extra level? Tracking the state and replication of every individual object (billions of them) would be prohibitively expensive. PGs (Placement Groups) group objects into manageable units (typically thousands of objects per PG). The cluster tracks replication, recovery, and peering at the PG level, not the object level.
A typical cluster has 100-200 PGs per OSD. For example, a cluster with 100 OSDs and 3x replication might have ~3000 PGs total. The exact number is chosen at pool creation time and can be adjusted later (auto-scaling was introduced in Ceph Nautilus, 2019).
PG states indicate health:
| State | Meaning |
|---|---|
active+clean | Normal — PG is serving I/O and all replicas are present |
active+degraded | Serving I/O but missing one or more replicas (e.g., an OSD is down) |
peering | OSDs are agreeing on PG state after a topology change |
recovering | Re-replicating data to restore full redundancy |
backfilling | Moving PG data to a new OSD after rebalancing |
stale | MON has not received updates from the PG’s primary OSD |
Storage interfaces
RBD — RADOS Block Device
RBD (RADOS Block Device) presents a Ceph pool as a block device. A virtual machine or container sees what looks like a normal disk (e.g., /dev/rbd0), but the data is striped across objects in RADOS.
Key features:
- Thin provisioning: a 1 TB RBD image does not allocate 1 TB upfront. Space is consumed only as data is written.
- Snapshots: instant, copy-on-write snapshots.
- Cloning: create a writable clone from a snapshot (used for fast VM provisioning).
- Live migration: move a running VM’s disk between pools without downtime.
RBD is the primary storage backend for Proxmox VE and OpenStack Cinder (block storage service). It is also used as a Kubernetes persistent volume via the CSI (Container Storage Interface) driver.
CephFS — Ceph File System
CephFS is a POSIX-compatible distributed filesystem. It requires MDS daemons to manage the namespace (directories, file names, permissions). File data is striped across RADOS objects on OSDs.
CephFS supports:
- Standard POSIX operations (open, read, write, stat, readdir, rename)
- Multiple active MDS daemons for metadata scaling
- Snapshots at the directory level
- Quotas
- Kernel and FUSE (Filesystem in Userspace) mount clients
CephFS is appropriate when workloads require a shared filesystem visible to multiple clients simultaneously (e.g., HPC — High-Performance Computing — scratch space, shared home directories).
RGW — RADOS Gateway
RGW (RADOS Gateway) exposes RADOS as an HTTP-based object storage service compatible with the Amazon S3 API and the OpenStack Swift API. Applications that use S3 SDKs (Software Development Kits) can point at an RGW endpoint instead of AWS.
RGW supports:
- Buckets and objects (S3 semantics)
- Multipart uploads
- Versioning
- Lifecycle policies (automatic expiration, transition to different storage classes)
- Multi-site replication (active-active across geographically separated clusters)
Request flow — how a PUT /bucket/photo.jpg S3 request reaches RADOS:
S3 Client (boto3, aws-cli, etc.)
|
| HTTPS PUT /bucket/photo.jpg
v
RGW (HTTP frontend: Beast or Civetweb)
| 1. Authenticates the request (S3 signature v4)
| 2. Resolves bucket -> RADOS pool
| 3. Splits the object into RADOS-sized chunks if needed
| 4. Writes each chunk via librados
v
librados
| Computes: pg_id = hash(chunk_name) mod num_pgs
| Looks up CRUSH map -> OSD set for this PG
v
Primary OSD
| Writes to local BlueStore
| Replicates to secondary + tertiary OSDs
| Returns ack to librados -> RGW -> client
v
HTTP 200 OK
RGW also stores bucket metadata (ACLs, versioning config, lifecycle rules) as separate RADOS objects in a dedicated metadata pool.
Is Amazon S3 built on Ceph?
No. Amazon S3 (Simple Storage Service) is Amazon’s proprietary, purpose-built distributed object storage system, launched in 2006. Its internals are not public, but it predates Ceph’s object gateway and uses Amazon’s own infrastructure. S3 and Ceph were developed independently.
The connection between S3 and Ceph is that the S3 API became a de-facto industry standard for object storage. Because millions of applications already knew how to talk to S3 (using AWS SDKs, the s3:// URI scheme, and the S3 REST API for operations like PutObject, GetObject, and ListBucket), other storage systems adopted the same API so that existing tools would work without modification. Ceph’s RGW is one such implementation. Others include:
- MinIO — a lightweight, S3-compatible object store often used for on-premises or Kubernetes deployments
- Wasabi, Backblaze B2, Cloudflare R2 — commercial cloud storage services offering S3-compatible APIs
- OpenStack Swift — predates the S3 API standardisation but RGW also supports the Swift API
The practical effect: if you deploy RGW on your Ceph cluster, any application that uses an S3 SDK (boto3 for Python, AWS SDK for Java, etc.) can point at your RGW endpoint instead of s3.amazonaws.com and work without code changes. This is why RGW exists — it makes Ceph a drop-in replacement for S3 in self-hosted environments.
What Ceph does NOT replicate from S3: S3 is deeply integrated with AWS’s identity system (IAM — Identity and Access Management), billing, and global infrastructure (multi-region automatic replication, S3 Intelligent-Tiering). RGW provides its own user management and supports multi-site replication, but these are separate implementations, not ports of AWS internals.
Replication and erasure coding
Replication (default)
By default, Ceph stores 3 copies (replicas) of every object, each on a different OSD (and ideally on different hosts or racks, depending on the CRUSH rule). This is called 3x replication or size=3.
Write path for a replicated pool:
Client
|
| write(object)
v
Primary OSD (osd.2)
|
+---> Secondary OSD (osd.5) [replica 2]
|
+---> Tertiary OSD (osd.7) [replica 3]
|
| (ack after all replicas confirm)
v
Client receives write confirmation
The primary OSD receives the write, forwards it to the secondary and tertiary replicas, and only acknowledges the write to the client after all replicas have committed it to stable storage. This guarantees that a confirmed write survives the loss of any single OSD (or even two OSDs, with 3x replication).
Cost: 3x replication means 200% storage overhead — 1 TB of data requires 3 TB of raw disk space.
Erasure coding
Erasure coding (EC) is an alternative that trades CPU time for storage efficiency, similar in concept to RAID (Redundant Array of Independent Disks) levels 5 and 6.
An erasure-coded pool is configured with k data chunks and m coding (parity) chunks. For example, EC 4+2 means:
- Each object is split into 4 data chunks
- 2 parity chunks are computed
- All 6 chunks are stored on different OSDs
- Any 4 of the 6 chunks are sufficient to reconstruct the object
Storage overhead: (k+m)/k = 6/4 = 1.5x (50% overhead, versus 200% for 3x replication).
Trade-offs:
- Reads are efficient when all chunks are available (parallel read from
kOSDs). - Writes are more expensive because parity must be computed (CPU cost) and partial overwrites require read-modify-write cycles.
- Recovery after a failure requires reading
kchunks to reconstruct the lost chunk.
Erasure coding is typically used for cold or archival data where the storage savings outweigh the write performance penalty.
When you need Ceph (and when you do not)
Ceph is a good fit when:
- You have 3 or more servers (to satisfy quorum and replication requirements)
- You need redundancy across physical machines, not just across disks in one machine
- You are building a cloud or virtualization platform (OpenStack, Proxmox VE, Kubernetes)
- You need tens of terabytes to petabytes of storage
- You want a unified system for block, file, and object storage
Ceph is overkill when:
- You have a single server or a small homelab — ZFS (a combined filesystem and volume manager with built-in checksumming and replication) or LVM (Logical Volume Manager — a Linux subsystem for flexible disk partitioning) on local disks is simpler, faster, and requires no network.
- Your total data fits comfortably on one machine with RAID.
- You do not need multi-node redundancy.
The minimum viable Ceph cluster typically requires 3 nodes (one MON + OSD per node, or colocated roles), making it impractical for single-machine setups.
Why enterprise SSDs matter for Ceph
Ceph performs many small random I/O operations during normal operation:
- Journaling: BlueStore (the default OSD backend since Ceph Luminous, 2017) writes a write-ahead log (WAL — Write-Ahead Log) and a RocksDB database (for object metadata) to a fast device, typically an SSD or NVME drive.
- Replication: every client write generates 3 network + disk writes (for 3x replication).
- Recovery and backfill: when an OSD fails and data is re-replicated, the cluster generates sustained random reads and writes across many OSDs.
- Scrubbing: periodic checksum verification reads every object.
Consumer SSDs are designed for bursty desktop workloads. Under sustained random writes, they throttle dramatically (often dropping from 500 MB/s to under 50 MB/s) as their internal SLC (Single-Level Cell) cache fills and the controller must write directly to slower TLC (Triple-Level Cell) or QLC (Quad-Level Cell) flash.
Enterprise SSDs differ in several critical ways:
| Property | Consumer SSD | Enterprise SSD |
|---|---|---|
| Sustained random write IOPS | Drops after cache fills | Consistent (power-loss protected, larger/better-managed cache) |
| Endurance (TBW) | 200-600 TBW (Terabytes Written — total data that can be written before flash cells wear out) | 3,000-30,000+ TBW |
| Power-loss protection | Usually none | Capacitors hold charge to flush volatile cache to flash on power failure |
| Over-provisioning | ~7% | 20-30% (more spare cells for wear leveling) |
The power-loss protection is especially important. Without it, an unexpected shutdown (power outage, kernel panic) can leave the OSD’s metadata database in a corrupted state, potentially causing data loss. Enterprise SSDs with PLP (Power-Loss Protection) capacitors guarantee that in-flight writes complete even if external power is cut.
See also
- Consumer vs Enterprise SSDs — detailed comparison of endurance, power-loss protection, and performance consistency
- NAND Flash — how the underlying storage technology works
- NVME — the communication protocol used by modern SSDs
- Block storage — the abstraction layer above physical storage
- Proxmox VE — virtualization platform with built-in Ceph integration