Skip to content

KC's Data & Life Notes

Apache Iceberg Best Practices in AWS

Apache Iceberg Best Practices in AWS ¶

Use Iceberg format version 2.
Use the AWS Glue Data Catalog as your data catalog.
Use the AWS Glue Data Catalog as lock manager.
Use Zstandard (ZSTD) compression: you could configure write.<file_type>.compression-codec to zstd in your Iceberg table properties. It strikes a balance between GZIP and Snappy, and offers good read/write performance without compromising the compression ratio.

Optimizing Storage ¶

Enable S3 Intelligent-Tiering

Archive or delete historic snapshots

Delete old snapshots: expire_snapshots

Set retention policies for specific snapshots: use Historical Tags to mark specific snapshots and define a retention policy for them.

ALTER TABLE glue_catalog.db.table
CREATE TAG 'EOM-01' AS OF VERSION 30 RETAIN 365 DAYS

- Archive old snapshots: - S3 Tags + S3 Life cycle rules

spark.sql.catalog.my_catalog.s3.write.tags.my_key1=my_val1
spark.sql.catalog.my_catalog.s3.delete-enabled=false
spark.sql.catalog.my_catalog.s3.delete.tags.my_key=to_archive
spark.sql.catalog.my_catalog.s3.write.table-tag-enabled=true
spark.sql.catalog.my_catalog.s3.write.namespace-tag-enabled=true

- Delete orphan files: remove_orphan_files - VACUUM statement: equals to expire_snapshots + remove_orphan_files in Spark.

VACUUM glue_catalog.db.table

CREATE TABLE your_table (
    ...
)
TBLPROPERTIES (
    'vacuum_max_snapshot_age_seconds' = '432000', -- 5 days
    'vacuum_min_snapshots_to_keep' = '1',
    'vacuum_max_metadata_files_to_keep' = '100'
);

References¶

General best practices | Using Apache Iceberg on AWS