Ch3 Structured APIs¶

When errors are detected using the Structured APIs

At the core of the Spark SQL engine are the Catalyst Optimizer and Project Tungsten. Together, these support the high-level DataFrame and Dataset APIs and SQL queries.

The Catalyst Optimizer takes a computational query and converts it into an execution plan. It goes through four transformational phases:

Analysis
- any columns or table names will be resolved by consulting an internal Catalog
Logical Optimization
- Applying a standard-rule based optimization approach
- the Catalyst optimizer will first construct a set of multiple plans and then
- using its cost-based optimizer (CBO), assign costs to each plan.
- These plans are laid out as operator trees
Physical Planning
Code Generation
- The final phase of query optimization involves generating efficient Java bytecode to run on each machine.

A Spark computation's four-phase journey

Real Example:

// In Scala
// Users DataFrame read from a Parquet table
val usersDF  = ...
// Events DataFrame read from a Parquet table
val eventsDF = ...
// Join two DataFrames
val joinedDF = users
  .join(events, users("id") === events("uid"))
  .filter(events("date") > "2015-01-01")