Building dbt Model Lineage in Python
Introduction
In the world of data engineering, dbt has become the de facto standard for transforming data in the warehouse. As a project grows from a handful of models to hundreds or even thousands, understanding the web of dependencies—the lineage—becomes critically important. While dbt docs provides a powerful visual graph, there are many scenarios where you need programmatic access to this lineage for advanced automation and analysis.
This article will guide you through building a Python script to parse a dbt project, extract model dependencies from ref() functions, and construct a complete lineage file that you can use for powerful, automated analysis.
The "Why": The Power of Programmatic Lineage
Why do we need to go beyond the visual graph?
- Automated Impact Analysis: When you modify a core model, you need to know every downstream model and exposure that will be affected. A script can instantly generate this list for your analysis and control, preventing unexpected breakages.
- Project Auditing & Governance: Programmatically analyze your project to find unused models, identify the most critical "hub" models with many dependencies, or enforce governance rules (e.g., "staging models should not reference other staging models").
- Version control your lineage: while dbt provides the lineage as an interactive graph, it always shows the lineage "as of now". Programmatically building the lineage and saving it into a csv file allows us to version control the lineage, opening up the possibility to track the change of lineage over time.
Let's build the tool to do it.
The Core Task: Parsing ref() with Regular Expressions
The foundation of dbt lineage is the {{ ref('model_name') }}
macro. Our first step is to reliably extract the model_name from any given SQL file. Python's built-in re module for regular expressions is perfect for this.
The regex pattern we'll use is:
([rR][eE][fF]\\([\\\"\\'])(.+?)([\\\"\\']\\))
Let's break it down:
([rR][eE][fF]\\([\\\"\\'])
: This matches the literal string ref( (case-insensitive, thanks to [rR], [eE], [fF]) followed by opening parenthesis ( and either a single \' or double \" quote.(.+?)
: This is the most important part. It's a non-greedy capture group that matches one or more of any character (.+?). This captures the model name itself.([\\\"\\']\\))
: This matches the closing quote (either " or ') and parenthesis.
We can wrap this logic in a clean, reusable Python function. Pre-compiling the regex with re.compile()
is a best practice for efficiency if you're calling it many times, which we are.
import re
# Pre-compile the regex for efficiency
get_ref = re.compile('([rR][eE][fF]\\([\\\"\\'])(.+?)([\\\"\\']\\))')
def get_parents(sql_content: str) -> list:
"""
Parses a string of SQL content and extracts all unique dbt refs.
"""
refs = get_ref.findall(sql_content)
if not refs:
return []
# The model name is the second captured group in our regex (index 1)
ref_list = [r[1] for r in refs]
# Return a sorted list of unique parent models
return sorted(set(ref_list))
Building the Dictionary for Direct Parent List
Now that we can parse a single file, we need to apply this logic to every model in our project. We'll use the modern pathlib module to recursively scan our models directory for all .sql files.
The following script iterates through each file, reads its content, and uses our get_parents function to build a dictionary. This dictionary will map each model to a list of its direct parents, so the key is each model, and the value is a list of the model's direct parents.
import pathlib, os
parents_dict = {}
# Point this to your dbt project's model directory
model_directory = pathlib.Path(r'C:\data_analytics\dbt-finance\models')
# Recursively find all .sql files in the directory
sql_files = model_directory.rglob("*.sql")
for file_path in sql_files:
with open(file_path, 'r', encoding='utf-8') as infile:
# The model name is the file name without the .sql extension
model_name = file_path.name[:-4]
sql_content = infile.read()
parents = get_parents(sql_content)
parents_dict[model_name] = parents
# The result is a dictionary like below:
# {'child_model': ['parent_model_1', 'parent_model_2'], ...}
# print(parents_dict)
Documenting Full Lineage Using Recursion
Recursion is a very beautiful idea in computer science. With the dictionary on hand, we know the direct parents of each model. With recursion, we can find the parents of each parent, and the parents of each grand parent... And we can save that child-parent-grand parent... relationship in a csv file by indenting the lines.
def print_lineage(model, parents_dict, level=0):
lines = ["\t" * level + model]
if model in parents_dict:
for parent in parents_dict[model]:
lines.extend(print_lineage(parent, parents_dict, level + 1))
return lines
Then we call this function for all the models and save the result into a csv file.
parents_key = [item[0] for item in parents_dict.items()]
# define output file path
output = r'C:\data_analytics\dbt-finance\models\dbt_finance_lineage.csv'
# check if the file exists, if yes remove it
if os.path.exists(output):
os.remove(output)
# open the file for append
with open(output, 'a') as f:
for parent in parents_key:
lines = print_lineage(parent, parents_dict)
for line in lines:
f.write(line + '\n')
# separator between different models
f.write('===============================================================' + '\n')
Sample Output
Below is a short sample of the output csv file that shows the lineage of some sql models in dbt.
Conclusion
With this csv file on hand, it is very easy to see the dependency of any model. By saving it as a csv file we can version control it using GIT, which will then enable us to view the change of lineage over time.
This approach provides several key benefits:
- Automation-Ready: The generated lineage file can be consumed by other scripts for automated impact analysis
- Version Control: Track how your data model relationships evolve over time
- Governance: Easily identify problematic dependencies or unused models
- Documentation: Create clear visual representations of your data pipeline architecture
This technical guide demonstrates a production-ready solution for programmatically extracting and analyzing dbt model dependencies.
Impressive. This is next level!
ReplyDelete