Orchestrating dbt with Dagster: A Powerful Duo for Data Pipelines

Vaibhav Srivastava
5 min readJun 19, 2024

--

What is Dagster?
Dagster is a tool specifically designed to manage the creation, scheduling, and monitoring of data pipelines. Here’s a breakdown of what that means:

  • Data pipelines are automated workflows that take in raw data, process it (clean, transform, etc.), and output it in a usable format. Imagine an assembly line for data!
  • Dagster acts as the conductor of this data assembly line. It helps you define the different stages of the pipeline (data ingestion, transformation, etc.) and ensures they run in the correct order and at the scheduled times.

What is dbt (data build tool)?
dbt (data build tool) is an open-source tool that specifically focuses on transforming data within a data warehouse. It helps analysts and engineers collaborate on building reliable and maintainable data transformations.

In simpler terms, dbt acts like an assembly line for transforming data within your data warehouse. It provides a structured approach for analysts and engineers to write, test, and deploy data transformations using familiar SQL code.

Why do we need Dagster for dbt (data build tool)?
While dbt excels at transforming data with SQL, it focuses on that specific task. Dagster comes in to orchestrate the bigger picture, offering several benefits when used together:

  • Unified Orchestration: Dagster can schedule and run dbt models alongside other data processing tasks, like Spark jobs or Python functions, creating a single data pipeline
  • Granular Control: Dagster treats each dbt model as a separate “asset.
  • Dependency Management: Dagster excels at defining dependencies between data assets.
  • Enhanced Observability: Dagster provides deep observability into dbt execution.
  • Collaboration and Scalability: Dagster’s design promotes collaboration by allowing teams to focus on their areas while offering a unified view of the entire pipeline.

In essence, Dagster complements dbt by providing a powerful orchestration layer, giving you more control, visibility, and scalability in your data workflows

Hands’on on dbt with Dagster
Setting up a new dbt Project using Snowflake

Once the project is created, make sure your folders are mapped correctly in dbt-project.yml file

This library allows you to seamlessly integrate dbt models into your Dagster pipelines.

https://pypi.org/project/dagster-dbt
/
https://pypi.org/project/dagster-webserver/

Create a dagster project
Dagster has a command for creating a dagster project from an existing dbt project:

dagster-dbt project scaffold --project-name dbt_dagster_project --dbt-project-dir=dbt_dagster

Start dagster

Once you have a project structure, you can start the development server and access the Dagster UI.

  • Navigate to your project directory in the terminal.
  • Run the following command:
DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev 
This starts the development server. You might be prompted to open the UI in your browser.

The Asset catalog page lists all assets in your Dagster deployment, which can be filtered by asset key and/or asset group.

The Runs page lists all job runs, which can be filtered by job name, run ID, execution status, or tag. Click a run ID to open the Run details page and view details for that run.

Dagster allows you to define schedules that automatically trigger jobs at specific intervals. Here’s a breakdown of setting up schedules in Dagster:

1. Defining the Schedule:
There are two main ways to define a schedule in Dagster:

  • ScheduleDefinition Class: This class offers a more granular approach for configuring schedules.
Image to demonstrate there are no existing schedule for this project

2. Specifying the Cron Expression: The cron expression defines the scheduling pattern for your job.

Dagster also supports libraries like croniter for more complex scheduling patterns
After adding the schedule in the schedule.py file, you can see it is visible on the UI

In Dagster, materialization refers to the process of executing a computation defined by an asset and persisting the results in a designated storage location.

Materializing is in progress as shown in the image
Materialization completed

Benefits of Materialization:

  • Data Persistence: Materialization ensures your generated data is stored persistently for future use by downstream assets or jobs in your pipeline.
  • Data Lineage: Dagster tracks the lineage of assets, allowing you to understand how each asset is derived from its upstream dependencies.
  • Versioning: Assets can be versioned, enabling you to track changes and revert to previous versions if necessary.

Finally, Dagster integrates with dbt to leverage dbt tests as asset checks. dbt Tests as Asset Checks:

  • Dagster’s asset system treats data units like tables or models as assets.
  • dbt tests can be loaded into Dagster and act as checks on the corresponding assets.
  • When a dbt test related to an asset is run, it validates the data within that asset.
Asset checks are currently an experimental feature in Dagster. The API might change in future releases.

Conclusion:

By combining dbt’s data transformation capabilities with Dagster’s orchestration power, you can build robust and scalable data pipelines. This unified approach simplifies management, enhances data quality, and streamlines your overall data engineering workflow. Whether you’re a seasoned data professional or just starting your journey, exploring dbt and Dagster together can unlock a new level of efficiency and control over your data.

And that’s a wrap!

I appreciate you and the time you took out of your day to read this! Please watch out (follow & subscribe) for more, Cheers!

--

--

Vaibhav Srivastava
Vaibhav Srivastava

Written by Vaibhav Srivastava

Solutions Architect | AWS Azure GCP Certified | Hybrid & Multi-Cloud Exp. | Technophile

Responses (1)