Iceberg Resources¶
Turbocharging Efficiency & Slashing Costs: Mastering Spark & Iceberg Joins with Storage-Partitioned¶
- Optimizing data pipelines has become synonymous with cost savings in cloud computing.
- Apache Spark and new table formats like Iceberg, Delta, and Hudi are improving how large datasets are managed.
- Expedia Group™ uses Spark and Iceberg to improve data processing workflows; storage-partitioned join (SPJ) is a feature that promises to greatly improve performance.
- Apache Iceberg is a high-performance table format for large analytics datasets that overcomes the limitations of Hive tables by providing features such as ACID transactions, schema evolution, partition evolution, and hidden partitioning.
- In distributed systems, non-broadcast joins are expensive operations due to data shuffling between nodes.
- Data shuffling involves significant network I/O and can drastically impact performance.
- Understanding Spark execution plans is essential for grasping performance.
- Sort-merge join is the default strategy for handling non-broadcasted joins in Spark.
- Iceberg sets write.spark.fanout.enabled to false by default, which forces a local sort before writing the data.
- Shuffle-hash join improves performance over sort-merge join by eliminating the CPU-intensive sorting steps.
- Storage-partitioned join (SPJ) removes 2 exchanges and 2 sorts.
- The storage-partitioned join is a new approach built on the concept of bucketed joins.
- Triggering SPJ requires the partition schema of the two joined Iceberg tables to be exactly the same and specific configurations to be applied.
- Scenarios tested are based on real-world data, with data size referring to deserialized data into memory.
- In Scenario 1, using a storage-partitioned join reduces the duration to just six minutes, with a cost of only $6.7, compared to 11 minutes and $12.2 for a sort-merge join.
- By combining three use cases, Expedia anticipates saving $5,000 for their next data pipeline.
- One of the most significant benefactors of the storage-partitioned join (SPJ) is the merge statement.
- SPJ is not applicable for non-equi-joins.
- Storage-partitioned joins, similar to the bucketed join, require both tables to follow the same partitioning schema, which can be challenging.