dbt¶
Domain Ownership¶
Pipelines¶
Concrete Example¶
Environments¶
Enabling Business-Friendly Access and Optimizing Access Control Through Customized dbt Project Configuration¶
Problem Statement¶
By default, dbt creates all datasets in a single GCP project. This behavior does not meet our requirements due to the following reasons:
- Insufficient IAM isolation: Managing all environments in one project increases the risk of permission misconfigurations.
- Resource conflicts: Other GCP services used in BigQuery may conflict across environments if they share the same project.
- Unfriendly table naming for downstream users: Final table names include technical prefixes like
fct
ordim
, which may confuse non-technical users such as marketers or editors.
To address this, we customize dbt's behavior to:
- Isolate each environment in its own GCP project.
- Assign datasets by domain in staging and production.
- Simplify final table names for easier consumption.
Data Sink¶
All raw data ingested through tools such as Airbyte, Fivetran, or custom data pipelines is initially landed in the sink
project.
Datasets in this project are source-oriented, organized by their origin. For example, food_ga4
, app_store
, google_play
etc.
This separation provides a clear boundary between raw source data and downstream transformations.
Staging Environment¶
- The GCP project used is the
stg
project. - All models are deployed in datasets by domain (e.g.,
news
,food
,fashion
, etc.) - Model names retain their original names, such as
stg_*
,int_*
,fct_*
,dim_*
. - All models are configured to expire after a certain duration to reduce storage costs.
Production Environment¶
- The GCP project used is the
prod
project. - Final models (data products) are deployed in datasets by domain (e.g.,
news
,food
,fashion
, etc.). All its upstream models are deployed in thehidden
dataset in the same project in order to encapsulate the implementation details. - Model names are simplified to aliases only, without technical prefixes, to make them more user-friendly for non-technical stakeholders (e.g., articles, app_installs, fb_summary_daily).
Implementation Summary¶
- Customize macros
- Use
if else
condition withtarget.name
indbt_project.yml
to dynamically configure settings based on the environment.
Cost Optimization¶
- Partitioned tables
- Incremental model with
insert_overwrite
strategy: generates amerge
statement that replaces entire partitions in the destination table.- Static partition
{% set partitions_to_replace = [ 'timestamp(current_date)', 'timestamp(date_sub(current_date, interval 1 day))' ] %} {{ config( materialized = 'incremental', incremental_strategy = 'insert_overwrite', partition_by = { 'field': 'event_time', 'data_type': 'timestamp', "granularity": "day" }, partitions = partitions_to_replace ) }} with events as ( select * from {{ref('events')}} {% if is_incremental() %} -- recalculate yesterday + today where timestamp_trunc(event_timestamp, day) in ({{ partitions_to_replace | join(',') }}) {% endif %} ), ... rest of model ...
on_schema_change
:append_new_columns
- Static partition
- Materialized model as
table
if the model is accessed multiple times in the downstream models, to avoid recomputing the same data repeatedly.
Data Mesh¶
- How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
- Data Mesh Principles and Logical Architecture
4 Key Principles:
Domain-oriented Decentralized Data Ownership
- From Medallion to DDD architecture
- CODEOWNERS
Data Products