Apache Iceberg is a table metadata format that provides more flexibility than alternatives by decoupling the data layout on disk from its features. It works in the following way:

  • manifest files keep metadata for a single datafile
  • manifest lists keep track of lists of manifest files
  • metadata files keeps track of all the snapshots of the table, by having an array of manifests lists per each snapshot

When reading or writing, the query engine asks a catalog what is the current version of the table, and use the metadata file at the right version as an entrypoint.

Tip

For example in Hive, partitions match folders, making it impossible to change partitioning without rewriting entire tables

The following material provide deeper informations about Iceberg:

Additionally, you might want to deep dive in how to use Iceberg with popular platforms:

Reference Material

Apache Iceberg The Definitive Guide