Introduction
The Apache Iceberg SparkSessionExtension plays a critical role when using Iceberg with Spark:
- it provides necessary code to hook read and write operations to Apache Iceberg Catalogs
- it provides an extension of DDL SQL that allows to modify partitions, equality identifiers for merge on read, write ordering, write distribution mode
In order to use Iceberg Catalogs, we need to to start Spark passing the iceberg-spark-runtime JAR as a dependency and configuring spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Writing data
Insert Overwrite
To replace the data in an Iceberg table or partition with the result of a query, INSERT OVERWRITE is used. Apache Spark provides two overwrite modes for this operation:
- static
- dynamic (by default, the mode is static)
Static overwrite mode
In static overwrite mode, Spark converts the PARTITION clause into a filter (predicate) for determining which partitions to overwrite. If you run the query without the PARTITION clause, it will replace all partitions.
spark.sql("""
INSERT OVERWRITE glue.test.employees
PARTITION (region = 'EMEA')
SELECT *
FROM employee_source
"""
)Tip
This mode cannot replace hidden partitions because the PARTITION clause can only reference table columns.
Dynamic overwrite mode
To configure dynamic overwrite mode, set the Spark config property, spark.sql.sources.partitionOverwriteMode=dynamic. In this mode, any partitions that correspond to rows returned by the SELECT query are replaced: spai
Since we filter the employee_source table with only the EMEA region data, only the corresponding EMEA partition will be overwritten in the employee table.
spark.sql("""
INSERT OVERWRITE glue.test.employee
SELECT * FROM employee_source
WHERE region = 'EMEA'
""")Note
Dynamic overwrite mode is generally recommended when writing to Iceberg tables because it provides granular control over which partitions get overwritten based on the query’s outcome.
Delete from
Apache Iceberg supports two types of deletions depending on the filter condition specified:
- If the filter condition matches entire partitions of a table, a metadata-only delete is performed. This is a highly efficient operation as no datafiles are touched.
- If the delete condition matches specific rows within a table, Iceberg will rewrite the affected datafiles.
Table maintenance operations
The extension enable functions on the catalog_name.system we can invoke:
expire_snapshots(table, older_than, retain_count), see Deletion of old datarewrite_data_files(table, strategy, sort_order, options), see Compaction strategiesrewrite_manifests(table), see Rewriting manifestsremove_orphan_files(table, older_than, dry_run), see Deletion of old data