Apache Iceberg expose metadata as traditional SQL tables, making easier to investigate issues, proactively identified improvements, etc. The important tables are:
- History
- Metadata log entries
- Snapshots
- Files
- Manifest
Note
This is a significant improvement compared to formats such as Hive, you’d have to depend on a particular engine implementing one-off commands, such as
SHOW PARTITIONS
History table
The history metadata table records the table’s evolution. Each row is composed of four fields:
made_current_atrepresents the exact timestamp when the corresponding snapshot was made the current snapshot.snapshot_idfield serves as a unique identifier for each snapshot.parent_idthe unique ID of the parent snapshot of the current snapshot.is_current_ancestor: indicates whether a snapshot is an ancestor of the table’s current snapshot. It differentiates snapshots that are part of the table’s present state lineage and those that have been invalidated from table rollbacks. The the history metadata table is useful for data recovery and version control as well as to identify table rollbacks.
-- Spark SQL
SELECT * FROM my_catalog.table.history;
-- Dremio
SELECT * FROM TABLE(table_history('catalog.table'))
-- Trino
SELECT * FROM "table$history"Metadata log entries
The metadata_log_entries metadata table keeps track of the evolution of the table by logging the metadata files generated during table updates:
- The
timestampfield records the exact date and time when the metadata was updated. - the
filefield indicates the location of the datafile that corresponds to that - The
latest_snapshot_idfield provides the identifier of the most recent snapshot at the time of the metadata update. - the
latest_schema_idfield contains the ID of the schema being used when the metadata log entry was created. - the
latest_sequence_numberfield signifies the order of the metadata updates. It’s an incrementing count that helps track the sequence of metadata changes over time.
Example query: find the latest snapshot per each schema
WITH Ranked_Entries AS (
SELECT
latest_snapshot_id,
latest_schema_id,
timestamp,
ROW_NUMBER() OVER(PARTITION BY latest_schema_id ORDER BY timestamp
DESC) as row_num
FROM
catalog.table.metadata_log_entries
WHERE
latest_schema_id IS NOT NULL
)
SELECT
latest_snapshot_id,
latest_schema_id,
timestamp AS latest_timestamp
FROM
Ranked_Entries
WHERE
row_num = 1
ORDER BY
latest_schema_id DESC;Snapshot
The snapshots metadata table is essential for tracking dataset versions and histories:
- the
committed_atfield signifies the precise timestamp when the snapshot was created - the
snapshot_idfield is a unique identifier for each snapshot. - the
operationfield lists a string of the types of operations that occurred, such as APPEND and OVERWRITE. - The
parent_idfield links to the snapshot ID of the snapshot’s parent - the
manifest_listfield offers detailed insights into the files comprising the snapshot. - the
summaryfield holds metrics about the snapshot, such as the number of added or deleted files, number of records, and other statistical data
Example query: Summarize operations by day
SELECT
operation,
COUNT(*) AS operation_count,
DATE(committed_at) AS date
FROM
catalog.table.snapshots
GROUP BY
operation,
DATE(committed_at)
ORDER BY
date;Snapshots committed_at vs history made_current_at
When new data is added or new files are added to the table, a snapshot is committed and immediately made the current state of the table. However, there are scenarios where the two diverge:
- Rollbacks would modify the history table
made_current_at, which wouldn’t correspond anymore to anycommitted_atof a snapshot - Concurrent operations: one snapshot can be committed without ever making it current
Files
The files metadata table showcases the current datafiles within a table:
content, represents the type of content in the file, with a 0 signifying a datafile, 1 a position delete file, and 2 an equality delete file.file_pathgives the exact location of each file.file_formatfield indicates the format of the datafile (Parquet, Avro, ORC)- the
spec_idfield corresponds to the partition spec ID that the file adheres to - the
partitionfield is a struct that describes the partition key value (i.e.{1999-01-01, 01}) - the
record_countfield reports the number of rNaNecords contained within each file, giving a measure of the file’s data volume. - the
file_size_in_bytesfield provides the total size of the file in bytes whilecolumn_sizesfurnishes the sizes of the individual columns. - the
value_counts,null_value_counts, andnan_value_countsfields provide the count of non-null, null, and NaN (Not a Number) values, respectively, in each column. lower_boundsandupper_boundsfields hold the minimum and maximum values in each column, providing essential insights into the data range within each file.- The
key_metadatafield contains implementation-specific metadata, if any exists. - The
split_offsetsfield provides the offsets at which the file is split into smaller segments for parallel processing. - The
equality_idsandsort_order_idfields correspond to the IDs relating to equality delete files, if any exist, and the IDs of the table’s sort order, if it has one.
Note
column_sizes,value_counts,null_value_counts,nan_value_counts,lower_boundsandupper_boundscolumns are all encoded as struct where the key represents the column index
Example query: Identifying candidate for compaction
SELECT
partition,
COUNT(*) AS num_files,
AVG(file_size_in_bytes) AS avg_file_size
FROM
catalog.table.files
GROUP BY
partition
ORDER BY
num_files DESC,
avg_file_size ASCExample query: Identifying partitions with data quality problems
SELECT
partition, file_path
FROM
catalog.table.files
WHERE
null_value_counts['3'] > 0
GROUP BY
partitionExample query: Listing files from previous snapshots
SELECT file_path, file_size_in_bytes
FROM catalog.table.files
VERSION AS OF <snapshot_id>;Manifests
The manifests metadata table details each of the table’s current manifest files.
- The
pathfield provides the filepath where the manifest is stored - The
lengthfield shows the size of the manifest file. - The
partition_spec_idfield indicates the specification ID of the partition that the manifest file is associated with - The
added_snapshot_idfield provides the ID of the snapshot that added this manifest file added_data_files_count,existing_data_files_count, anddeleted_data_files_countare self explanatory- the
partition_summariesfield is an array of field_summary structs that summarize partition-level statistics: contains_null, contains_nan, lower_bound, and upper_bound.
Warning
contains_nan could return null when the information isn’t available from the file’s metadata, which usually occurs when reading from a V1 Table
Tip
Statistics by default are collected on 32 columns
Example query: Searching manifests file shorter than average
WITH avg_length AS (
SELECT AVG(length) as average_manifest_length
FROM catalog.table.manifests
)
SELECT
path,
length
FROM
catalog.table.manifests
WHERE
length < (SELECT average_manifest_length FROM avg_length);Example query: Count of many files are added per snapshot
SELECT
added_snapshot_id,
SUM(added_data_files_count) AS total_added_data_files
FROM
catalog.table.manifests
GROUP BY
added_snapshot_id