Skip to content

KC's Data & Life Notes

Data

Data¶

Jun 16, 2026
in Data
9 min read

Iceberg Table Maintenance at Scale: Lessons from 6 Big Companies

After reading this article, you will be able to answer...

6 家公司各自打造出來的 Iceberg Table Maintenance Service (TMS)，在架構上有什麼異同？
這些獨立發展的系統收斂出了哪些共同 patterns？
如果從零設計一個能在多團隊、大規模下擴展的 TMS，哪些 building blocks 是非有不可的？

Jun 15, 2026
in Data
10 min read

從 Netflix 看 Iceberg 在 Exabyte 規模下還沒解決的問題

After reading this article, you will be able to answer...

為什麼 Netflix 全面採用 Iceberg 之後，還需要額外引入這麼多系統？
Table maintenance、Trino、ClickHouse、LanceDB 這四個 use cases 反映了 Iceberg 的哪些不足？
資料平台團隊在導入 Iceberg 前，應該先想清楚哪些問題？

Jun 6, 2026
in Data
6 min read

從三家 OLAP 產品反推 Iceberg 的設計挑戰

After reading this article, you will be able to answer...

ClickHouse、Firebolt、StarTree 在整合 Iceberg 時，為什麼不約而同做了類似的選擇？
這些選擇背後反映的是 Iceberg spec 的哪些結構性不足？
Iceberg community 正在怎麼補上這些不足？

Jun 3, 2026
in Data
6 min read

Efficient Column Update 與 Column Families：Iceberg 對 AI/ML Wide Table 的回應

After reading this article, you will be able to answer...

為什麼 Iceberg 現有的 row-level update 在 wide table 場景下會變得很貴？
Flip the Axis 跟 Column Families 想解決什麼問題？怎麼解？
這個方向對 Iceberg 跟新一代 table format（例如 LanceDB）的競爭代表著什麼？

Jun 2, 2026
in Data
5 min read

Re-thinking Iceberg Metadata Structure in v4

After reading this article, you will be able to answer...

為什麼沿用多年的三層 metadata 結構，在新場景下開始不夠用了？
Iceberg v4 提出的 Adaptive Metadata Tree 想解決什麼？怎麼解？
這個方向有什麼值得期待的地方？隱憂在哪裡？

May 31, 2026
in Data
7 min read

Lessons from Slack：在 180PB 規模上維運 Iceberg

After reading this article, you will be able to answer...

在 180PB、每天 300TB 流入的規模下，Slack 怎麼做到 99.9% 的 Iceberg 維護成功率？
IceChipper 的設計為什麼長這樣？哪些替代方案被否決了，為什麼？
維護 4,000 張 Iceberg tables 會踩到哪些坑？

Jul 15, 2025
in Data
11 min read

The Lakehouse Series: Apache Iceberg Overview

After reading this article, you will be able to answer...

How does Iceberg's 3-tier metadata architecture (metadata files, manifest lists, manifest files) work together?
What role do catalogs play in Iceberg, and why does the REST catalog standard matter for multi-engine compatibility?
What query capabilities does Iceberg unlock — time travel, incremental reads, metadata queries — and when would you use each?

Jul 8, 2025
in Data
8 min read

The Lakehouse Series: Apache Hudi Overview

After reading this article, you will be able to answer...

How does Hudi's timeline-based architecture track table changes and enable time travel?
When should you choose Copy-on-Write (COW) vs. Merge-on-Read (MOR), and what are the trade-offs?
What query types does Hudi support — snapshot, incremental, read-optimized — and how do they differ?

Jul 1, 2025
in Data
4 min read

The Lakehouse Series: From Data Lakes to Data Lakehouses

After reading this article, you will be able to answer...

What limitations do traditional data lakes face, and why aren't they enough?
How do data lakehouses merge the flexibility of data lakes with the structured management of data warehouses?
What are the major open-source lakehouse formats, and what enterprise-grade capabilities define the architecture?

Jun 2, 2025
in Data
8 min read

The Lakehouse Series: OLTP vs. OLAP (A Parquet Primer)

After reading this article, you will be able to answer...

What are the key differences between OLTP and OLAP workloads, and why does storage format matter?
How does Parquet organize data internally and optimize storage using techniques like dictionary encoding and RLE?
Where does Parquet fall short in today's data landscape?