Skip to content

KC's Data & Life Notes

Blog

Blog¶

Aug 14, 2025
in Data
6 min read

What's New in Apache Airflow 3

Aug 7, 2025
in Data
4 min read

5 Practical Ways to Speed Up Your Apache Spark Queries

TLDR

After reading this article, you will learn how to:

Apply filters before joins to reduce data shuffling
Avoid premature collect() actions that cause memory bottlenecks
Replace UDFs with built-in functions for better performance
Optimize duplicate removal using efficient methods
Implement broadcast joins for small table operations

Jul 15, 2025
in Data
11 min read

The Lakehouse Series: Apache Iceberg Overview

TLDR

After reading this article, you will learn:

Apache Iceberg's 3-tier metadata architecture (metadata files, manifest lists, and manifest files)
How Iceberg catalogs work, including the REST catalog standard for multi-engine compatibility
Query capabilities including time travel, incremental reads, and metadata queries
Spark procedures for snapshot management, metadata maintenance, and table migration

Jul 8, 2025
in Data
8 min read

The Lakehouse Series: Apache Hudi Overview

TLDR

After reading this article, you will learn:

How Apache Hudi's timeline-based architecture tracks all table changes and enables time travel queries
The difference between Copy-on-Write (COW) and Merge-on-Read (MOR) storage types for different workload patterns
How Hudi organizes their data in a structured way with table services
The various query types Hudi supports, including snapshot, incremental, and read-optimized queries

Jul 1, 2025
in Data
4 min read

The Lakehouse Series: From Data Lakes to Data Lakehouses

TLDR

After reading this article, you will learn:

What limitations traditional data lakes face
How data lakehouses merge the flexibility of data lakes with the structured management of data warehouses
What enterprise-grade capabilities define lakehouse architecture
What the major open-source lakehouse formats are

Jun 2, 2025
in Data
8 min read

The Lakehouse Series: OLTP vs. OLAP (A Parquet Primer)

TLDR

After reading this article, you will learn:

The key differences between OLTP and OLAP workloads, and why storage format matters
How Parquet organizes data internally and optimizes data storage using compression techniques like dictionary encoding and RLE
Where Parquet falls short in today's data landscape

May 5, 2025
in MkDocs, CI/CD
11 min read

How to Use MkDocs to Integrate GitHub Actions and Git Submodule for Cross-repo Documentation

TLDR

After reading this article, you will learn how to:

Use Git Submodule to centrally manage documentation sources across multiple projects
Configure GitHub Actions for cross-project automation and integration workflows
Utilize Reusable Workflows to reuse CI/CD scripts and reduce maintenance costs
Leverage MkDocs Monorepo Plugin to merge documentation from multiple projects into a single website