Incremental models
Overview
Incremental models are built as tables in your data lake. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the target table, which is the table that has already been built.
Often, the rows you filter for on an incremental run will be the rows in your source data that have been created or updated since the last time dbt ran. As such, on each dbt run, your model gets built incrementally.
Using an incremental model limits the amount of data that needs to be transformed, vastly reducing the runtime of your transformations. This improves warehouse performance and reduces compute costs.
Using incremental materializations
Like the other materializations built into dbt, incremental models are defined with select
statements, with the materialization defined in a config block.
{{
config(
materialized='incremental'
)
}}
select ...
To use incremental models, you also need to tell dbt:
- How to filter the rows on an incremental run.
- The uniqueness constraint of the model (if any).
Filtering rows on an incremental run
To tell dbt which rows it should transform on an incremental run, wrap valid SQL that filters for these rows in the is_incremental()
macro.
Often, you'll want to filter for "new" rows, as in rows that have been created since the last time dbt ran this model. The best way to find the timestamp of the most recent run of this model is by checking the most recent timestamp in your target table. dbt makes it easy to query your target table by using the {{ this }} variable.
For example, a model that includes a computationally slow transformation on a column can be built incrementally, as follows:
{{
config(
materialized='incremental'
)
}}
select
*,
my_slow_function(my_column)
from raw_app_data.events
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where event_time > (select max(event_time) from {{ this }})
{% endif %}
Optimizing your incremental model
For more complex incremental models that make use of Common Table Expressions (CTEs), you should consider the impact of the position of the is_incremental()
macro on query performance.
Defining a uniqueness constraint (optional)
A unique_key
determines whether a record has new values and should be updated. Using unique_key
, you can ensure that each row from the source table is represented by a single row in your incremental model without duplicates. Not specifying a unique_key
will result in append-only behavior, which means dbt inserts all rows returned by the model's SQL into the preexisting target table without regard for whether the rows represent duplicates.
This optional parameter for incremental models specifies a field (or combination of fields) that can uniquely identify each row within your model. You can define unique_key
in a configuration block at the top of your model, and it can be a list in addition to a single column name.
The unique_key
should be supplied in your model definition as a string representing a simple column or a list of single-quoted column names that can be used together, for example, ['col1', 'col2', …])
.
In cases where you need multiple columns in combination to uniquely identify each row, we recommend you pass these columns as a list (unique_key = ['user_id', 'session_number']
), rather than a string expression (unique_key = 'concat(user_id, session_number)'
).
By using the first syntax, which is more universal, dbt can ensure that the columns will be templated into your incremental model materialization in a way that's appropriate to your database.
When you define a unique_key
, you'll see this behavior for each row of "new" data returned by your dbt model:
- If the same
unique_key
is present in the "new" and "old" model data, dbt will update/replace the old row with the new row of data. IOMETE (iceberg) is using MERGE INTO syntax for that - If the
unique_key
is not present in the "old" data, dbt will insert the entire row into the table. In other words it becomesappend-only
mode.
iomete-dbt
supports two incremental strategy (only for iceberg tables):
merge
(default)insert_overwrite
(optional)
merge
incremental strategy uses the unique_key
configuration. But, the insert_overwrite
strategy does not use unique_key
, because it operates on partitions of data rather than individual rows.
For more information, see About incremental_strategy.
unique_key
example
Consider a model that calculates the number of daily active users (DAUs) based on an event stream. As source data arrives, you will want to recalculate the number of DAUs for both the day that dbt last ran and any days since then. The model would look as follows:
{{
config(
materialized='incremental',
unique_key='date_day'
)
}}
select
date_trunc('day', event_at) as date_day,
count(distinct user_id) as daily_active_users
from raw_app_data.events
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where date_day >= (select max(date_day) from {{ this }})
{% endif %}
group by 1
Building this model incrementally without the unique_key
parameter would result in multiple rows in the target table for a single day – one row for each time dbt runs on that day. Instead, the inclusion of the unique_key
parameter ensures the existing row is updated instead.
How do I rebuild an incremental model?
If your incremental model logic has changed, the transformations on your new rows of data may diverge from the historical transformations, which are stored in your target table. In this case, you should rebuild your incremental model.
To force dbt to rebuild the entire incremental model from scratch, use the --full-refresh
flag on the command line. This flag will cause dbt to drop the existing target table in the database before rebuilding it for all-time.
dbt run --full-refresh --select my_incremental_model+
It's also advisable to rebuild any downstream models, as indicated by the trailing +
.
For detailed usage instructions, check out the dbt run documentation.
Understanding incremental models
When should I use an incremental model?
It's often desirable to build models as tables in your data lakehouse since downstream queries are more performant. While the table
materialization also creates your models as tables, it rebuilds the table on each dbt run. These runs can become problematic in that they use a lot of compute when either:
- source data tables have millions, or even billions, of rows.
- the transformations on the source data are computationally expensive (that is, take a long time to execute); for example, complex Regex functions, or UDFs, are being used to transform data.
Like many things in programming, incremental models are a trade-off between complexity and performance. While they are not as straightforward as the view
and table
materializations, they can lead to significantly better performance of your dbt runs.
Understanding the is_incremental() macro
The is_incremental()
macro will return True
if:
- the destination table already exists in the database
- dbt is not running in full-refresh mode
- the running model is configured with
materialized='incremental'
Note that the SQL in your model needs to be valid whether is_incremental()
evaluates to True
or False
.
How do incremental models work behind the scenes?
The IOMETE platform supports incremental models for Iceberg tables which is the default table format. The incremental build will not work if file_format
is explicitly specified other than iceberg
(e.g., file_format = 'parquet'
).
Apache Iceberg’s ACID Transaction management is used to ensure this is executed as a single unit of work.
What if the columns of my incremental model change?
New on_schema_change
config in dbt version v0.21.0
Incremental models can now be configured to include an optional on_schema_change
parameter to enable additional control when incremental model columns change. These options enable dbt to continue running incremental models in the presence of schema changes, resulting in fewer --full-refresh
scenarios and saving query costs.
You can configure the on_schema_change
setting as follows.
Option 1. Globally
On dbt_project.yaml
:
models:
+on_schema_change: "sync_all_columns"
Option 2. Per model
Setting on the model configuration:
{{
config(
materialized='incremental',
unique_key='date_day',
on_schema_change='fail'
)
}}
The possible values for on_schema_change
are:
ignore
: Default behavior (see below).fail
: Triggers an error message when the source and target schemas divergeappend_new_columns
: Append new columns to the existing table. Note that this setting does not remove columns from the existing table that are not present in the new data.sync_all_columns
: Adds any new columns to the existing table and removes any columns that are now missing. Note that this is inclusive of data type changes
None of the on_schema_change
behaviors backfill values in old records for newly added columns. If you need to populate those values, we recommend running manual updates or triggering a --full-refresh
Default behavior
This is the behavior if on_schema_change: ignore
, which is set by default, and on older versions of dbt.
If you add a column to your incremental model and execute a dbt run
, this column will not appear in your target table.
Similarly, if you remove a column from your incremental model and execute a dbt run
, this column will not be removed from your target table.
Instead, whenever the logic of your incremental changes, execute a full-refresh run of both your incremental model and any downstream models.
About incremental_strategy
On the dbt-iomete
adapter, an optional incremental_strategy
config controls the code that dbt uses to build incremental models. The adapter supports two incremental strategies:
merge
(default, iceberg table only)insert_overwrite
(optional)
Configuring incremental strategy
The incremental_strategy
config can either be specified in specific models, or for all models in your dbt_project.yml
file
Option 1. Globally
On dbt_project.yaml
:
models:
+incremental_strategy: "insert_overwrite"
Option 2. Per model
Setting on the model configuration:
{{
config(
materialized='incremental',
unique_key='date_day',
incremental_strategy='insert_overwrite',
...
)
}}
select ...
Strategy-specific configs
- Changelog
- v0.20.0: Introduced
merge_update_columns
- v0.21.0: Introduced
on_schema_change
- v0.20.0: Introduced
If you are using the merge
strategy and have specified a unique_key
, by default, dbt will entirely overwrite matched rows with new values.
On the dbt-iomete
adapter, which supports the merge
strategy, you may optionally pass a list of column names to a merge_update_columns
config. In that case, dbt will update only the columns specified by the config and keep the previous values of other columns.
{{
config(
materialized = 'incremental',
unique_key = 'id',
merge_update_columns = ['email', 'ip_address'],
...
)
}}
select ...