Retail Lakehouse with Debezium, Kafka, Iceberg, and Trino¶
💡 Highlights¶
Highlights
- Implemented real-time, event-driven data pipelines by capturing MySQL database change events (CDC) with Debezium and streaming them into Kafka, enabling downstream analytics.
- Designed a non-intrusive CDC architecture leveraging MySQL/PostgreSQL binary logs, requiring no changes to source systems while ensuring exactly-once delivery and high fault tolerance via Kafka Connect.
- Improved system resilience and observability through Debezium's offset tracking and recovery features, enabling resumable pipelines and reliable data integration across distributed environments.
Highlights
- Provisioned a fault-tolerant Kafka cluster on Kubernetes using the Strimzi Operator, enabling declarative configuration and seamless lifecycle management
- Enabled KRaft (Kafka Raft Metadata mode) with a dual-role cluster, removing dependency on ZooKeeper and simplifying cluster architecture
- Designed for high availability by replicating Kafka topics and internal state across multiple brokers using replication factor and in-sync replicas (ISR).
Highlights
- Ensured centralized commit coordination for Apache Iceberg via the Kafka Sink Connector, enabling consistent and atomic writes across distributed systems.
- Achieved exactly-once delivery semantics between Kafka and Iceberg tables, minimizing data duplication and ensuring data integrity.
- Utilized
DebeziumTransform
SMT to adapt Debezium CDC messages for compatibility with Iceberg's CDC feature, supporting real-time change propagation. - Enabled automatic table creation and schema evolution, simplifying integration and reducing operational overhead when ingesting data into Iceberg tables.
Highlights
- Adopted Apache Iceberg to bring ACID-compliant transactions and schema evolution to the data lake architecture.
- Managed Iceberg tables using AWS Glue Data Catalog as the catalog layer and Amazon S3 as the storage layer.
- Enabled data debugging and auditability through Iceberg's time travel and snapshot rollback features.
- Implemented branching and tagging (WAP) to support isolated writes, data validation, and safe promotion in production workflows
Highlights
- Integrated Trino to enable federated SQL queries across Apache Iceberg (S3) and external systems like BigQuery, improving analytical agility.
- Simplified data access across multiple data sources without data duplication, enabling ad-hoc analytics and reporting from a unified SQL interface.
- Integrated Google OAuth 2.0 with Trino to enable token-based authentication, improving platform auditability and user accountability.
Solution Architecture Overview¶
Background¶
You are a solution architect at a fast-growing e-commerce company that processes hundreds of thousands of transactions and user interactions daily.
The current data setup is as follows:
- Order and transaction data is stored in AWS RDS (MySQL).
- User behavioral data—such as product views, cart additions, and ad clicks—is collected via Google Analytics 4 (GA4) and streamed directly into BigQuery.
- The team is familiar with both AWS and Google Cloud, and Amazon S3 is used as an intermediate storage layer for certain analytics workflows.
Legacy Architecture¶
- Daily batch ETL jobs export data from RDS to feed internal reporting tools (e.g., Tableau or Google Looker Studio).
- GA4 user events are streamed into BigQuery in real time and queried for behavioral analytics.
- There is a clear data silo between MySQL and BigQuery, making it difficult to join and analyze user behavior alongside transactional data.
- Due to the large volume of data, the team prefers not to move data unnecessarily between systems to avoid latency and cost.
Problems and Challenges¶
- Reporting latency is too high due to daily batch processing. Marketing teams cannot react in real time to campaign performance.
- There is no unified way to analyze both behavioral and transactional data together, which limits insight into user journeys and conversion paths.
- Data duplication or migration across platforms is undesirable due to storage and operational costs.
- There is no centralized authentication or audit logging mechanism to track query access and ensure data governance across multiple clouds.
Business Requirements¶
- Marketing teams need real-time reports that combine website behavior and order data to quickly evaluate and adjust campaign performance.
- Product managers want to track funnel metrics like impressions, clicks, and conversions without waiting for daily batch reports.
- Data analysts need access to versioned, clean, and reliable data to support accurate ad-hoc analysis and troubleshoot issues when metrics look off.
- IT and data governance teams require secure, auditable access to data, with clear control over who can query what.