August 2025¶
Highlight of the Month¶
Summarize my biggest breakthrough, project, or insight in this month:
This month I mainly focused on 2 things. First, I dove deep into Apache Iceberg performance tuning techniques, learning how to optimize both write and read operations, manage table compactions, and implement effective partitioning strategies. This knowledge is crucial for maintaining efficient data lakes and ensuring high query performance. Second, I dedicated time to preparing for technical interviews, specific for data lakehouse engineer role.
What I Built, Published, or Experimented with¶
- Published Best Practices for Optimizing Apache Iceberg Workloads in AWS
- Published Deep Dive into Kafka Connect Icerberg Sink Connector
- Published Exactly Once Semantics in Kafka
- Published What's New in Apache Airflow 3
- Published 5 Practical Ways to Speed Up Your Apache Spark Queries
- Experimented with Pyrefly
- Experimented with
colima
for replacing Docker Desktop on Mac (and it was great!) - Experimented with Claude Code.
What I Learned¶
Short reflections on what I actually learned or became more confident in:
- Learned the performance tuning techniques for maintaining Iceberg tables, including write optimization, read optimization, compaction, partitioning strategies, etc.
Reflections – Beyond Just Tech¶
Soft-skill insights or workflow/communication/process reflections:
After interviewed several times for data engineering roles and finally got a job offer, I realized that
- Architecture diagram is super important for communicating my design ideas effectively and it catch interviewers' attention right away
- Preparing some materials based on the interviewer's introduction to the team and company after the first round interview is very helpful for the follow-up interviews, as it shows my enthusiasm and interest in the role and company.
- Side projects are definitely a plus, as it demonstrates my passion and commitment to learning and growing in the field.
What I Consumed¶
A list of articles, papers, courses, or videos I read/watched/completed:
Performance Tuning & Optimization¶
- Spark Performance Tuning | Spark Docs
- Ch7 Optimizing and Tuning Spark Applications | Learning Spark
- Ch4 Optimizing the Performance of Apache Iceberg | The Definitive Guide of Apache Iceberg
- Best practices for optimizing Apache Iceberg workloads | AWS Docs
Real-time Data Processing (RisingWave, Debezium, Flink CDC)¶
SQLMesh¶
- SQLMesh sets a new precedent with support for multi-engine projects
- Multi-Repo guide
- Virtual Data Environments
- SQLMesh and virtual data environments
- Environments
Airflow¶
- Apache Airflow® 3 is Generally Available!
- Apache Airflow 3.0 Is Here: The Most Significant Release Yet
Iceberg & Hive¶
Goals for Next Month¶
Set 2–3 simple goals to stay focused and accountable:
- Explore observability and SRE practices, hands-on with OpenTelemetry, Prometheus, Grafana
- Get more familiar with Trino in terms of performane tuning