Skip to content

Apache Iceberg Best Practices in AWS

  • Use Iceberg format version 2.
  • Use the AWS Glue Data Catalog as your data catalog.
  • Use the AWS Glue Data Catalog as lock manager.
  • Use Zstandard (ZSTD) compression: you could configure write.<file_type>.compression-codec to zstd in your Iceberg table properties. It strikes a balance between GZIP and Snappy, and offers good read/write performance without compromising the compression ratio.

Optimizing Storage

  • Enable S3 Intelligent-Tiering
  • Archive or delete historic snapshots

    • Delete old snapshots: expire_snapshots
    • Set retention policies for specific snapshots: use Historical Tags to mark specific snapshots and define a retention policy for them.

      ALTER TABLE glue_catalog.db.table
      CREATE TAG 'EOM-01' AS OF VERSION 30 RETAIN 365 DAYS
      

      - Archive old snapshots: - S3 Tags + S3 Life cycle rules

      spark.sql.catalog.my_catalog.s3.write.tags.my_key1=my_val1
      spark.sql.catalog.my_catalog.s3.delete-enabled=false
      spark.sql.catalog.my_catalog.s3.delete.tags.my_key=to_archive
      spark.sql.catalog.my_catalog.s3.write.table-tag-enabled=true
      spark.sql.catalog.my_catalog.s3.write.namespace-tag-enabled=true
      
      - Delete orphan files: remove_orphan_files - VACUUM statement: equals to expire_snapshots + remove_orphan_files in Spark.
      VACUUM glue_catalog.db.table
      

    CREATE TABLE your_table (
        ...
    )
    TBLPROPERTIES (
        'vacuum_max_snapshot_age_seconds' = '432000', -- 5 days
        'vacuum_min_snapshots_to_keep' = '1',
        'vacuum_max_metadata_files_to_keep' = '100'
    );
    

References