Iceberg Resources¶

Turbocharging Efficiency & Slashing Costs: Mastering Spark & Iceberg Joins with Storage-Partitioned ¶

Optimizing data pipelines has become synonymous with cost savings in cloud computing.
Apache Spark and new table formats like Iceberg, Delta, and Hudi are improving how large datasets are managed.
Expedia Group™ uses Spark and Iceberg to improve data processing workflows; storage-partitioned join (SPJ) is a feature that promises to greatly improve performance.
Apache Iceberg is a high-performance table format for large analytics datasets that overcomes the limitations of Hive tables by providing features such as ACID transactions, schema evolution, partition evolution, and hidden partitioning.
In distributed systems, non-broadcast joins are expensive operations due to data shuffling between nodes.
Data shuffling involves significant network I/O and can drastically impact performance.
Understanding Spark execution plans is essential for grasping performance.
Sort-merge join is the default strategy for handling non-broadcasted joins in Spark.
Iceberg sets write.spark.fanout.enabled to false by default, which forces a local sort before writing the data.
Shuffle-hash join improves performance over sort-merge join by eliminating the CPU-intensive sorting steps.
Storage-partitioned join (SPJ) removes 2 exchanges and 2 sorts.
The storage-partitioned join is a new approach built on the concept of bucketed joins.
Triggering SPJ requires the partition schema of the two joined Iceberg tables to be exactly the same and specific configurations to be applied.
Scenarios tested are based on real-world data, with data size referring to deserialized data into memory.
In Scenario 1, using a storage-partitioned join reduces the duration to just six minutes, with a cost of only $6.7, compared to 11 minutes and $12.2 for a sort-merge join.
By combining three use cases, Expedia anticipates saving $5,000 for their next data pipeline.
One of the most significant benefactors of the storage-partitioned join (SPJ) is the merge statement.
SPJ is not applicable for non-equi-joins.
Storage-partitioned joins, similar to the bucketed join, require both tables to follow the same partitioning schema, which can be challenging.

Iceberg Resources¶

Turbocharging Efficiency & Slashing Costs: Mastering Spark & Iceberg Joins with Storage-Partitioned¶

Turbocharging Efficiency & Slashing Costs: Mastering Spark & Iceberg Joins with Storage-Partitioned ¶