Blog¶
5 Practical Ways to Speed Up Your Apache Spark Queries
TLDR
After reading this article, you will learn how to:
- Apply filters before joins to reduce data shuffling
- Avoid premature
collect()
actions that cause memory bottlenecks - Replace UDFs with built-in functions for better performance
- Optimize duplicate removal using efficient methods
- Implement broadcast joins for small table operations
The Lakehouse Series: Apache Iceberg Overview
TLDR
After reading this article, you will learn:
- Apache Iceberg's 3-tier metadata architecture (metadata files, manifest lists, and manifest files)
- How Iceberg catalogs work, including the REST catalog standard for multi-engine compatibility
- Query capabilities including time travel, incremental reads, and metadata queries
- Spark procedures for snapshot management, metadata maintenance, and table migration
The Lakehouse Series: Apache Hudi Overview
TLDR
After reading this article, you will learn:
- How Apache Hudi's timeline-based architecture tracks all table changes and enables time travel queries
- The difference between Copy-on-Write (COW) and Merge-on-Read (MOR) storage types for different workload patterns
- How Hudi organizes their data in a structured way with table services
- The various query types Hudi supports, including snapshot, incremental, and read-optimized queries
The Lakehouse Series: From Data Lakes to Data Lakehouses
TLDR
After reading this article, you will learn:
- What limitations traditional data lakes face
- How data lakehouses merge the flexibility of data lakes with the structured management of data warehouses
- What enterprise-grade capabilities define lakehouse architecture
- What the major open-source lakehouse formats are
The Lakehouse Series: OLTP vs. OLAP (A Parquet Primer)
TLDR
After reading this article, you will learn:
- The key differences between OLTP and OLAP workloads, and why storage format matters
- How Parquet organizes data internally and optimizes data storage using compression techniques like dictionary encoding and RLE
- Where Parquet falls short in today's data landscape
How to Use MkDocs to Integrate GitHub Actions and Git Submodule for Cross-repo Documentation
TLDR
After reading this article, you will learn how to:
- Use Git Submodule to centrally manage documentation sources across multiple projects
- Configure GitHub Actions for cross-project automation and integration workflows
- Utilize Reusable Workflows to reuse CI/CD scripts and reduce maintenance costs
- Leverage MkDocs Monorepo Plugin to merge documentation from multiple projects into a single website