Skip to content

2025

The Lakehouse Series: Apache Iceberg Overview

TLDR

After reading this article, you will learn:

  • Apache Iceberg's 3-tier metadata architecture (metadata files, manifest lists, and manifest files)
  • How Iceberg catalogs work, including the REST catalog standard for multi-engine compatibility
  • Query capabilities including time travel, incremental reads, and metadata queries
  • Spark procedures for snapshot management, metadata maintenance, and table migration

The Lakehouse Series: Apache Hudi Overview

TLDR

After reading this article, you will learn:

  • How Apache Hudi's timeline-based architecture tracks all table changes and enables time travel queries
  • The difference between Copy-on-Write (COW) and Merge-on-Read (MOR) storage types for different workload patterns
  • How Hudi organizes their data in a structured way with table services
  • The various query types Hudi supports, including snapshot, incremental, and read-optimized queries

The Lakehouse Series: From Data Lakes to Data Lakehouses

TLDR

After reading this article, you will learn:

  • What limitations traditional data lakes face
  • How data lakehouses merge the flexibility of data lakes with the structured management of data warehouses
  • What enterprise-grade capabilities define lakehouse architecture
  • What the major open-source lakehouse formats are

The Lakehouse Series: OLTP vs. OLAP (A Parquet Primer)

TLDR

After reading this article, you will learn:

  • The key differences between OLTP and OLAP workloads, and why storage format matters
  • How Parquet organizes data internally and optimizes data storage using compression techniques like dictionary encoding and RLE
  • Where Parquet falls short in today's data landscape

How to Use MkDocs to Integrate GitHub Actions and Git Submodule for Cross-repo Documentation

TLDR

After reading this article, you will learn how to:

  • Use Git Submodule to centrally manage documentation sources across multiple projects
  • Configure GitHub Actions for cross-project automation and integration workflows
  • Utilize Reusable Workflows to reuse CI/CD scripts and reduce maintenance costs
  • Leverage MkDocs Monorepo Plugin to merge documentation from multiple projects into a single website