Skip to content

Data

從 Netflix 看 Iceberg 在 Exabyte 規模下還沒解決的問題

TLDR

After reading this article, you will learn:

  • 為什麼 Netflix 這種大規模採用 Iceberg 的公司,反而在好幾個情境下還要在 Iceberg 外面另外建立新的系統
  • Netflix 在 Trino、ClickHouse、LanceDB 三個 use cases 上的取捨,反應出 Iceberg 目前在哪些場景上仍有不足
  • 一般資料平台團隊在導入 Iceberg 前,可以從哪兩個角度先把問題想清楚

從三家 OLAP 產品反推 Iceberg 的設計挑戰

TLDR

After reading this article, you will learn:

  • ClickHouse、Firebolt、StarTree 這類低延遲 OLAP 產品在整合 Iceberg 時,會在同樣幾個層面自己加 cache、自行研發 Reader、把 metadata planning 分散化,這些不約而同的選擇背後是 Iceberg 規範本身的不足
  • 哪些是目前 Iceberg 的三個結構性不足
  • Iceberg community 已經在 v4 metadata、File Format API、index first-class 這幾個面向上實質前進,正在逐步補上規範本身的不足

Efficient Column Update 與 Column Families:Iceberg 對 AI/ML Wide Table 的回應

TLDR

After reading this article, you will learn:

  • 為什麼 Iceberg 現有以 row 為單位的 update 機制,在 AI/ML feature engineering 帶來的 wide table 場景下會變得非常昂貴
  • Iceberg 正在討論的 Flip the Axis 與 Column Families 是什麼,為什麼這個方向同時涵蓋了 column-level update 與 layout flexibility 兩種應用
  • 為什麼這個方向值得期待,以及它跟新一代 table format(例如 LanceDB)的競爭,對 Iceberg 接下來的位置意味著什麼

Re-thinking Iceberg Metadata Structure in v4

TLDR

After reading this article, you will learn:

  • 為什麼原本支撐 PB 資料規模的三層 metadata 結構,在 small commit、streaming 與 wide table 這些新場景下顯得不夠用
  • Iceberg v4 提出的 Adaptive Metadata Tree 是什麼,為什麼這是 Iceberg community 認為值得做的方向
  • v4 這個方向有什麼值得肯定的地方、隱憂在哪裡,以及為什麼 Iceberg community 接下來的演進仍然值得期待

Lessons from Slack:在 180PB 規模上維運 Iceberg

TLDR

After reading this article, you will learn:

  • Slack 如何在每天有超過 300TB 資料流入 180PB 規模的 Data Lakehouse 下,穩定維持 99.9% 的 Iceberg 維護成功率
  • Slack 在開發 IceChipper 時的思考過程:4 種資料、5 條設計準則、3 條被否決的替代方案
  • Slack 在維護 4,000 張 Iceberg tables 過程中遇到的 3 大痛點以及他們如何因應

5 Practical Ways to Speed Up Your Apache Spark Queries

TLDR

After reading this article, you will learn how to:

  • Apply filters before joins to reduce data shuffling
  • Avoid premature collect() actions that cause memory bottlenecks
  • Replace UDFs with built-in functions for better performance
  • Optimize duplicate removal using efficient methods
  • Implement broadcast joins for small table operations

The Lakehouse Series: Apache Iceberg Overview

TLDR

After reading this article, you will learn:

  • Apache Iceberg's 3-tier metadata architecture (metadata files, manifest lists, and manifest files)
  • How Iceberg catalogs work, including the REST catalog standard for multi-engine compatibility
  • Query capabilities including time travel, incremental reads, and metadata queries
  • Spark procedures for snapshot management, metadata maintenance, and table migration

The Lakehouse Series: Apache Hudi Overview

TLDR

After reading this article, you will learn:

  • How Apache Hudi's timeline-based architecture tracks all table changes and enables time travel queries
  • The difference between Copy-on-Write (COW) and Merge-on-Read (MOR) storage types for different workload patterns
  • How Hudi organizes their data in a structured way with table services
  • The various query types Hudi supports, including snapshot, incremental, and read-optimized queries

The Lakehouse Series: From Data Lakes to Data Lakehouses

TLDR

After reading this article, you will learn:

  • What limitations traditional data lakes face
  • How data lakehouses merge the flexibility of data lakes with the structured management of data warehouses
  • What enterprise-grade capabilities define lakehouse architecture
  • What the major open-source lakehouse formats are