Introduction

The Apache Iceberg SparkSessionExtension plays a critical role when using Iceberg with Spark:

In order to use Iceberg Catalogs, we need to to start Spark passing the iceberg-spark-runtime JAR as a dependency and configuring spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Writing data

Insert Overwrite

To replace the data in an Iceberg table or partition with the result of a query, INSERT OVERWRITE is used. Apache Spark provides two overwrite modes for this operation:

  • static
  • dynamic (by default, the mode is static)

Static overwrite mode

In static overwrite mode, Spark converts the PARTITION clause into a filter (predicate) for determining which partitions to overwrite. If you run the query without the PARTITION clause, it will replace all partitions.

spark.sql("""
	INSERT OVERWRITE glue.test.employees
	PARTITION (region = 'EMEA')
	SELECT *
	FROM employee_source
	"""
)

Tip

This mode cannot replace hidden partitions because the PARTITION clause can only reference table columns.

Dynamic overwrite mode

To configure dynamic overwrite mode, set the Spark config property, spark.sql.sources.partitionOverwriteMode=dynamic. In this mode, any partitions that correspond to rows returned by the SELECT query are replaced: spai

Since we filter the employee_source table with only the EMEA region data, only the corresponding EMEA partition will be overwritten in the employee table.

spark.sql("""
	INSERT OVERWRITE glue.test.employee
	SELECT * FROM employee_source
	WHERE region = 'EMEA'
""")

Note

Dynamic overwrite mode is generally recommended when writing to Iceberg tables because it provides granular control over which partitions get overwritten based on the query’s outcome.

Delete from

Apache Iceberg supports two types of deletions depending on the filter condition specified:

  • If the filter condition matches entire partitions of a table, a metadata-only delete is performed. This is a highly efficient operation as no datafiles are touched.
  • If the delete condition matches specific rows within a table, Iceberg will rewrite the affected datafiles.

Table maintenance operations

The extension enable functions on the catalog_name.system we can invoke: