APFS Copy-on-Write and Clones

Copy-on-write semantics

In APFS, writes never overwrite existing blocks. When a file is modified:

  1. New blocks are allocated from the free pool
  2. The new data is written to those blocks
  3. The file’s extent tree is updated to point to the new blocks
  4. The old blocks are freed — unless something else still references them (a snapshot, a clone)

This is what makes snapshots instant and free: taking a snapshot doesn’t copy any data, it just pins the current state of the B-trees so the old blocks can’t be freed.

Clones

APFS exposes copy-on-write at the user level via the clonefile(2) system call. When you clone a file, you get two directory entries pointing at the same underlying extents (contiguous ranges of blocks on disk). No data is copied. Storage cost ≈ 0 (just metadata).

macOS uses this internally constantly: cp -c, Time Machine’s local backups, and various system operations use cloning.

Why this breaks du

du works by summing st_blocks (the number of 512-byte blocks allocated to a file, as reported by stat(2)) for every file it encounters. When two cloned files share extents, du counts those blocks twice — once per directory entry. The reported total is higher than actual physical storage consumed.

The reverse problem with snapshots: blocks held exclusively by a snapshot (the old versions of files you’ve since modified) have no live directory entry. du never sees them. They consume real space that du cannot account for.

This is not a bug — it’s a fundamental limitation of any tool that walks the directory tree to measure storage. The directory tree only shows the live state of files. APFS’s block accounting operates at a layer below the directory tree.

See macOS Disk Reporting for the full analysis.

See also