APFS Copy-on-Write and Clones
Copy-on-write semantics
In APFS, writes never overwrite existing blocks. When a file is modified:
- New blocks are allocated from the free pool
- The new data is written to those blocks
- The file’s extent tree is updated to point to the new blocks
- The old blocks are freed — unless something else still references them (a snapshot, a clone)
This is what makes snapshots instant and free: taking a snapshot doesn’t copy any data, it just pins the current state of the B-trees so the old blocks can’t be freed.
Clones
APFS exposes copy-on-write at the user level via the clonefile(2) system call. When you clone a file, you get two directory entries pointing at the same underlying extents (contiguous ranges of blocks on disk). No data is copied. Storage cost ≈ 0 (just metadata).
macOS uses this internally constantly: cp -c, Time Machine’s local backups, and various system operations use cloning.
Why this breaks du
du works by summing st_blocks (the number of 512-byte blocks allocated to a file, as reported by stat(2)) for every file it encounters. When two cloned files share extents, du counts those blocks twice — once per directory entry. The reported total is higher than actual physical storage consumed.
The reverse problem with snapshots: blocks held exclusively by a snapshot (the old versions of files you’ve since modified) have no live directory entry. du never sees them. They consume real space that du cannot account for.
This is not a bug — it’s a fundamental limitation of any tool that walks the directory tree to measure storage. The directory tree only shows the live state of files. APFS’s block accounting operates at a layer below the directory tree.
See macOS Disk Reporting for the full analysis.
See also
- APFS - What It Is — overview of APFS design decisions
- APFS - Snapshots — how snapshots pin old blocks
- macOS Disk Reporting — all three failure modes of
duon APFS