June 2025¶
Highlight of the Month¶
Summarize my biggest breakthrough, project, or insight in this month:
What I Consumed¶
A list of articles, papers, courses, or videos I read/watched/completed:
Read¶
- Databricks將以10億美元買下開源雲端資料庫新創Neon
- What Is a Lakebase?
- Openness
- Separation of storage and compute (the most important feature imo)
- Serverless
- Modern development workflow
- Built for AI agents
- Lakehouse integration
- An Introduction to the Hudi and Flink Integration
- Building a Real-time Data Lake with Flink CDC
- Evolution to the Data Lakehouse
- What is a data lakehouse? | Databricks Docs
- What Is a Lakehouse? | Databricks Blog
- What’s New in Apache Iceberg Format Version 3?
- Apache Iceberg™ v3: Moving the Ecosystem Towards Unification
- 12-Factor Agents
- Practical Guide for Model Selection for Real‑World Use Cases
- 愛好 AI Engineer 電子報 🚀 模型上下文協定 MCP 應用開發 #27
- I really liked how the author described two different ways of building agents: one that relies on a customizable framework, and another that's more lightweight and built using just the core features of the programming language. It instantly reminded me of the old debates between TensorFlow 1.0 and PyTorch.
- After reading this article, I realized that the strength of senior engineers lies in their ability to quickly pick up new technologies and analyze different approaches logically with their own keen insights. This is a skill that I aspire to develop.
- Featurestore at Agoda: How We Optimized Dragonfly for High-Performance Caching
- How Agoda manages 1.8 trillion Events per day on Kafka
- 2-step logging approach.
- Multiple smaller Kafka clusters instead of 1 Large Kafka cluster per Data Center
- Agoda employs a robust Kafka auditing system by aggregating message counts via background threads in client libraries, routing audits to a dedicated Kafka cluster, and implementing monitoring and alerting mechanisms for audit messages.
- Agoda calculates cluster capacity by comparing each resource’s usage against its upper limit and taking the highest percentage to represent the dominant constraint at that moment.
- Agoda attributes cost back to teams, which transformed team mindsets, driving proactive cost management and accountability across Agoda
- The new auth system empowers the Kafka team to control access, manage credentials, and protect sensitive data through fine-grained ACLs
- Operational scalability is ensured through automated tooling that streamlines and simplifies system management.
- Scaling Kafka to Support PayPal’s Data Growth
- Cluster Management: Kafka Config Service, ACLs, PayPal Kafka Libraries, QA Environment
- Monitoring and Alerting
- Configuration Management
- Enhancements and Automation: Patching security vulnerabilities, Security Enhancements, Topic Onboarding, MirrorMaker Onboarding, Repartition Assignment Enhancements,
- Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn
- Pyright 上手指南:Python 型別檢查的新選擇
- Data versioning as your ‘Get out of jail’ card – DVC vs. Git-LFS vs. dolt vs. lakeFS
- Unity Catalog | GitHub
- Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi
- Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared
- Big Metadata: When Metadata is Big Data
- Vortex: A Stream-oriented Storage Engine For Big Data Analytics
- DuckLake: SQL as a Lakehouse Format
- It simplifies lakehouses by using a standard SQL database for all metadata while still storing data in open formats like Parquet, just like BigQuery with Spanner and Snowflake with FoundationDB.
- GitHub MCP Exploited: Accessing private repositories via MCP
Watched¶
- Data News: Snowflake/Databricks Announcements, Iceberg V3
- Semantic Layer in Snowflake (Semantic Views) and Databricks (Unity Catalog metric views)
- Snowflake Openflow
- Data News: DuckLake, Confluent’s TableFlow, New Book!
- Me, personally I think DuckLake is a game-changer for lakehouses, but the speaker didn't think so.
- Perceived benefits of DuckLake's pproach:
- Eliminates separate catalog abstraction
- Offloads scan planning
- Easier to get started
- Concerns and skepticism about DuckLake:
- Reintroducing database overhead
- Scaling concerns:
- Shared resources for scan planning
- Scalability of the central database
- Limits innovation and discovery
- Unclear details on scan planning
- How I build Agentic MCP Servers for Claude Code (Prompts CHANGE Everything)
- Apache Iceberg V3 and Beyond
- Apache Iceberg V3 Ahead
- Architecting an Iceberg Lakehouse
- Introducing Pyrefly: A new type checker and IDE experience for Python
- Tampa Bay DE Meetup: The Who, What and Why of Data Lake Table Formats (Iceberg, Hudi, Delta Lake)
- Watch a Complete NOOB Try DuckDB and DuckLake for the first time
- Introducing DuckLake
- Next Steps: the ability to import and export from existing lakehouse formats like Iceberg and the ability to talk to more databases.
- Why build Event-Driven AI systems?
- Why MCP really is a big deal.
- MCP offers pluggable, discoverable, and composable solutions that simplify complex integrations.
- Why Everyone’s Talking About MCP?
- It addresses the issue faced by \(M\) AI vendors, where implementing \(N\) tools results in an \(M \times N\) complexity problem. Instead, it simplifies the problem to an \(M+N\) complexity solution.
- Five primitives of MCP: resources, tools, prompts, roots and sampling.
Completed Courses¶
- GitHub Copilot 進階開發實戰
- Customize chat responses in VS Code
- Instruction files and Prompt files are used to customize the chat responses in VS Code.
- Prompt engineering for Copilot Chat
- Customize chat responses in VS Code
- MCP Course
What I Created or Tried¶
What I built, experimented with, or implemented:
- Built a side project: Retail Lakehouse with Flink, Kafka, Iceberg, and Trino
- Built a side project: A Unified SQL-based Data Pipeline
- Published a blog post: The Lakehouse Series: DuckLake — The Next Big Thing?
- Published a blog post: The Lakehouse Series: Hudi vs Iceberg vs Delta Lake — Format Wars Begin
- Published a blog post: The Lakehouse Series: OLTP vs. OLAP (A Parquet Primer)
- Published a blog post: How to Use MkDocs to Integrate GitHub Actions and Git Submodule for Cross-repo Documentation
- Experimented with Pyrefly
- Experimented with Speechify
- I really like the ability to listen to articles and papers while doing other tasks. It helps me consume more content without feeling overwhelmed. What I like the most is that they have Snoop Dogg as a voice option, which adds a fun twist to the experience! Could you imagine listening to a data lakehouse article narrated by Snoop Dogg? ☠️
- Experimented with DuckDB and DuckLake
- Experimented with FastMCP v2
- Experimented with
kubectl-ai
- I've tried
kubectl-ai
as a MCP server to test the integration with VS Code Copilot Agent Mode, but nothing special happened.
- I've tried
What I Learned¶
Short reflections on what I actually learned or became more confident in:
Reflections – Beyond Just Tech¶
Soft-skill insights or workflow/communication/process reflections:
Goals for Next Month¶
Set 2–3 simple goals to stay focused and accountable:
- How Discord Stores Trillions of Messages
- Neon
- MindsDB
- Data Catalog
- Trino
- Data Lake at Wise powered by Trino and Iceberg
- Running Trino as exabyte-scale data warehouse
- Empowering self-serve data analytics with a text-to-SQL assistant at LinkedIn
- Best practices and insights when migrating to Apache Iceberg for data engineers
- Many clusters and only one gateway - Starburst, Naver, and Bloomberg at Trino Summit 2023
- Visualizing Trino with Superset - Preset at Trino Summit 2023
- Trino workload management - Airbnb at Trino Summit 2023