Skip to content

Retail Lakehouse with Debezium, Kafka, Iceberg, and Trino

💡 Highlights

Highlights

  • Implemented real-time, event-driven data pipelines by capturing MySQL database change events (CDC) with Debezium and streaming them into Kafka, enabling downstream analytics.
  • Designed a non-intrusive CDC architecture leveraging MySQL/PostgreSQL binary logs, requiring no changes to source systems while ensuring exactly-once delivery and high fault tolerance via Kafka Connect.
  • Improved system resilience and observability through Debezium's offset tracking and recovery features, enabling resumable pipelines and reliable data integration across distributed environments.

Highlights

  • Provisioned a fault-tolerant Kafka cluster on Kubernetes using the Strimzi Operator, enabling declarative configuration and seamless lifecycle management
  • Enabled KRaft (Kafka Raft Metadata mode) with a dual-role cluster, removing dependency on ZooKeeper and simplifying cluster architecture
  • Designed for high availability by replicating Kafka topics and internal state across multiple brokers using replication factor and in-sync replicas (ISR).

Highlights

  • Ensured centralized commit coordination for Apache Iceberg via the Kafka Sink Connector, enabling consistent and atomic writes across distributed systems.
  • Achieved exactly-once delivery semantics between Kafka and Iceberg tables, minimizing data duplication and ensuring data integrity.
  • Utilized DebeziumTransform SMT to adapt Debezium CDC messages for compatibility with Iceberg's CDC feature, supporting real-time change propagation.
  • Enabled automatic table creation and schema evolution, simplifying integration and reducing operational overhead when ingesting data into Iceberg tables.

Highlights

  • Adopted Apache Iceberg to bring ACID-compliant transactions and schema evolution to the data lake architecture.
  • Managed Iceberg tables using AWS Glue Data Catalog as the catalog layer and Amazon S3 as the storage layer.
  • Enabled data debugging and auditability through Iceberg's time travel and snapshot rollback features.
  • Implemented branching and tagging (WAP) to support isolated writes, data validation, and safe promotion in production workflows

Highlights

  • Integrated Trino to enable federated SQL queries across Apache Iceberg (S3) and external systems like BigQuery, improving analytical agility.
  • Simplified data access across multiple data sources without data duplication, enabling ad-hoc analytics and reporting from a unified SQL interface.
  • Integrated Google OAuth 2.0 with Trino to enable token-based authentication, improving platform auditability and user accountability.

🏗️ Architecture

Architecture Overview

Architecture Overview (Observability Engineering)

🗂️ What's Inside?

First, clone the repository:

mkdir -p ~/Projects
cd ~/Projects
git clone git@github.com:kuanchoulai10/retail-lakehouse.git

The project structure looks like this:

.
├── kafka-cluster
├── mysql
├── kafka-debezium-mysql-connector
├── kafka-iceberg-connector
├── trino
├── spark
└── prometheus-grafana

📑 Deployment Steps

The basic deployment path includes the following steps:

Deployment Steps

  • Deploy a Kafka Cluster via the Strimzi Operator
  • Deploy a MySQL Database
  • Deploy a Debezium Kafka Source Connector
  • Deploy an Iceberg Kafka Sink Connector
  • Deploy a Trino Cluster
  • Deploy a Spark Cluster (WIP)
  • Deploy Prometheus and Grafana (WIP)