June 2025¶
Highlight of the Month¶
Summarize my biggest breakthrough, project, or insight in this month:
In this month, I mainly focused on my side projects - Retail Lakehouse with Debezium, Kafka, Iceberg, and Trino and A Unified SQL-based Data Pipeline. Not only implementing these projects, I also watched and read some videos and articles about what are the best practices of deploying, managing and maintaining Trino, Iceberg and Kafka Cluster.
What I Created or Tried¶
What I built, experimented with, or implemented:
- Built a side project: Retail Lakehouse with Debezium, Kafka, Iceberg, and Trino
- Built a side project: A Unified SQL-based Data Pipeline
- Published a blog post: The Lakehouse Series: OLTP vs. OLAP (A Parquet Primer)
- Published a blog post: How to Use MkDocs to Integrate GitHub Actions and Git Submodule for Cross-repo Documentation
- Experimented with
kubectl-ai
- I've tried
kubectl-ai
as a MCP server to test the integration with VS Code Copilot Agent Mode, but nothing special happened.
- I've tried
- Experimented with Speechify
- I really like the ability to listen to articles and papers while doing other tasks. It helps me consume more content without feeling overwhelmed. What I like the most is that they have Snoop Dogg as a voice option, which adds a fun twist to the experience! Could you imagine listening to a data lakehouse article narrated by Snoop Dogg? ☠️
What I Learned¶
Short reflections on what I actually learned or became more confident in:
- Focus on OUTPUT, not INPUT. Nowadays, it's information overload everywhere, and it's easy to get lost in the sea of content. Instead of just consuming more information, I should focus on creating something meaningful with the knowledge I gain (Talk is cheap, show me the code).
Reflections – Beyond Just Tech¶
Soft-skill insights or workflow/communication/process reflections:
- The ability to have a small talk with the speakers in the social gathering is a great way to build connections and learn from their experiences. In the past, I over focused on the technical aspects, but now I realize that soft skills and networking are equally important in the tech industry.
- My dad decided to give up active cancer treatment this month and to focus on quality of life instead. After this month of reflection, I realized that there are still many things I want to do and achieve in my life, not only in my career but also in my personal life. Life is too short to be only focused on work.
What I Consumed¶
A list of articles, papers, courses, or videos I read/watched/completed:
Read¶
- 如何建立獨一無二的 GitHub Profile!與三個很酷的設計及應用 🚀
- Databricks買下Tabular,企圖改善資料相容性
- Databricks將以10億美元買下開源雲端資料庫新創Neon
- What Is a Lakebase?
- Openness
- Separation of storage and compute (the most important feature imo)
- Serverless
- Modern development workflow
- Built for AI agents
- Lakehouse integration
- What Is a Lakehouse? | Databricks Blog
- 愛好 AI Engineer 電子報 🚀 模型上下文協定 MCP 應用開發 #27
- I really liked how the author described two different ways of building agents: one that relies on a customizable framework, and another that's more lightweight and built using just the core features of the programming language. It instantly reminded me of the old debates between TensorFlow 1.0 and PyTorch.
- After reading this article, I realized that the strength of senior engineers lies in their ability to quickly pick up new technologies and analyze different approaches logically with their own keen insights. This is a skill that I aspire to develop.
- How Agoda manages 1.8 trillion Events per day on Kafka
- 2-step logging approach.
- Multiple smaller Kafka clusters instead of 1 Large Kafka cluster per Data Center
- Agoda employs a robust Kafka auditing system by aggregating message counts via background threads in client libraries, routing audits to a dedicated Kafka cluster, and implementing monitoring and alerting mechanisms for audit messages.
- Agoda calculates cluster capacity by comparing each resource’s usage against its upper limit and taking the highest percentage to represent the dominant constraint at that moment.
- Agoda attributes cost back to teams, which transformed team mindsets, driving proactive cost management and accountability across Agoda
- The new auth system empowers the Kafka team to control access, manage credentials, and protect sensitive data through fine-grained ACLs
- Operational scalability is ensured through automated tooling that streamlines and simplifies system management.
- Scaling Kafka to Support PayPal’s Data Growth
- Cluster Management: Kafka Config Service, ACLs, PayPal Kafka Libraries, QA Environment
- Monitoring and Alerting
- Configuration Management
- Enhancements and Automation: Patching security vulnerabilities, Security Enhancements, Topic Onboarding, MirrorMaker Onboarding, Repartition Assignment Enhancements,
- Pyright 上手指南:Python 型別檢查的新選擇
- DuckLake: SQL as a Lakehouse Format
- It simplifies lakehouses by using a standard SQL database for all metadata while still storing data in open formats like Parquet, just like BigQuery with Spanner and Snowflake with FoundationDB.
Watched¶
- Spark Structured Streaming with Kafka
- Understand RAFT without breaking your brain
- Data Lake at Wise powered by Trino and Iceberg
- I'm an ex-Google interviewer. You're doing LeetCode wrong.
- Data News: Snowflake/Databricks Announcements, Iceberg V3
- Semantic Layer in Snowflake (Semantic Views) and Databricks (Unity Catalog metric views)
- Snowflake Openflow
- Data News: DuckLake, Confluent’s TableFlow, New Book!
- Me, personally I think DuckLake is a game-changer for lakehouses, but the speaker didn't think so.
- Perceived benefits of DuckLake's pproach:
- Eliminates separate catalog abstraction
- Offloads scan planning
- Easier to get started
- Concerns and skepticism about DuckLake:
- Reintroducing database overhead
- Scaling concerns:
- Shared resources for scan planning
- Scalability of the central database
- Limits innovation and discovery
- Unclear details on scan planning
- Introducing Pyrefly: A new type checker and IDE experience for Python
- Why build Event-Driven AI systems?
- Why MCP really is a big deal.
- MCP offers pluggable, discoverable, and composable solutions that simplify complex integrations.
- Why Everyone’s Talking About MCP?
- It addresses the issue faced by \(M\) AI vendors, where implementing \(N\) tools results in an \(M \times N\) complexity problem. Instead, it simplifies the problem to an \(M+N\) complexity solution.
- Five primitives of MCP: resources, tools, prompts, roots and sampling.
Completed Courses¶
- GitHub Copilot 進階開發實戰
- Customize chat responses in VS Code
- Instruction files and Prompt files are used to customize the chat responses in VS Code.
- Prompt engineering for Copilot Chat
- Customize chat responses in VS Code
- MCP Course
Goals for Next Month¶
Set 2–3 simple goals to stay focused and accountable:
- Started to submit my resume to some companies, so I need to prepare for interviews.
- Publish a series of blog posts about the best practices of deploying, managing and maintaining Trino Cluster and Kafka Cluster and how the big companies use them in production environment.
- Focus on verbal output, not just written output.
- Host a series of mock interviews with people in the same community to practice my soft skills and networking abilities.