Hi, I’m Steve with GrowthLoop. Over the last 2 years, I’ve used dbt daily to build data models on Redshift, Snowflake, and BigQuery.
This is the start of a series of articles on tips, best practices, and general knowledge about dbt and how to use it to build complex data models in your Cloud Data Warehouse. This series will focus on use in BigQuery but is generally applicable to your data warehouse of choice.
First, what is dbt?
dbt (data build tool) is an open-source command-line tool that uses compilable SQL select statements to allow data analysts to transform the data in their data warehouses more effectively and with the best practices of software development. It the middle layer between your raw data, in whatever form you get it in, and your transformed data in the format you need.
There are all kinds of useful features that ship with dbt including an automatic DAG generator, automatic documentation platform, and many macros. For more on dbt, how to set it up, and what it is exactly, check out the official dbt website, or our upcoming posts.
Let's Start with the Basics
Before diving into dbt and advanced SQL tricks, let’s get our SQL basics down. One important but easily overlooked aspect of SQL is the order in which a statement actually executes.
Like the PEMDAS monstrosity above, the way you write SQL can cause confusion for readers including yourself. To reduce confusion, keep the following logical processing order in mind.
- FROM & JOIN: What tables to get the base data
- WHERE: Filters on the base data
- GROUP BY: Aggregates the base data
- HAVING: Filters on the aggregate data
- WINDOW Functions : Row-level aggregation
- SELECT: Fetch data by column name and alias
- DISTINCT: Deduplicate the fetched data
- UNION: Append additional query's results
- ORDER BY: Sort data
- LIMIT & OFFSET: Discard all rows but X rows starting at Y
So what does that mean?
There are several implications to the execution order and knowing them will make you a better SQL user.
Data reductions are best done at the top of the funnel ie JOIN and WHERE
Nine times out of ten if your query is inefficient, you have an issue in your JOINs or you have not filtered as much as you should in your WHEREs. JOINs, in particular, can be dangerous because, if you are not careful, you can inadvertently have multiple records satisfy your ON clauses that will cause duplicates.
Combat this by choosing the most appropriate JOIN type, vigorously reviewing your ON statements and defining exactly what information you need to include in your WHERE filters.
For example, if you wished to analyze recent customer payment trends, limit your query to be the smallest subset of data you can use to answer your question.
👎 DON'T
select
c.name
, o.amount_paid
from customers as c
outer join orders as o
on c.id = o.customer_id
👍 DO
select
c.name
, o.amount_paid
from customers as c
inner join orders as o
on c.id = o.customer_id
and c.start_date
where c.start_date > '2021-01-01'
Aliased columns are not addressable in WHERE clauses
Because aliasing does not occur until step 6, you cannot use the new names in your WHERE statements. This has been the cause of many frustrations and long repeated lines in queries.
Combat this by aliasing in an upstream CTE or dbt model (more on this later).
Aliased columns are addressable in ORDER BY clauses
On the other hand, aliased columns are available in ORDER BY clauses because they are the penultimate operation to run.
A final note.
This order is generally the order in which queries will be run by your database engine however the engine’s optimizer will change this order if it detects an improvement in possible efficiency. CAST functions or WHERE clauses on columns used in joins often will be run first to save on compute resources.
There are many other optimizations that database engines use in production but, I am honestly not an expert in the deeper layers of optimization.
Look out for the next article where we’ll be diving into using dbt. Thanks for reading!
✌️ Steve