BlogData Engineering

dbt Sources: Documenting and Testing Raw Data Before Transformation

James Okafor
James Okafor
Lead Data Engineer
·November 3, 202710 min read

dbt sources are YAML definitions that describe the raw database tables and views that dbt models build on. They provide a governed entry point into the raw data layer — with freshness assertions, data tests, and documentation — so that transformation models have a reliable, tested foundation rather than direct references to undocumented upstream tables.

dbt sources are YAML definitions that declare the raw database tables — the tables in your warehouse that were written by your data ingestion pipelines, ELT tools, or operational database replication — that your dbt models read from. Without sources, a dbt model references an upstream table using a direct schema.table reference. With sources, the same table is referenced using the source() function, which provides documentation, testing, freshness monitoring, and lineage tracking that a direct table reference cannot.

What Sources Enable

A source definition looks like this:

version: 2

sources:

- name: salesforce

database: raw

schema: salesforce

description: Raw Salesforce data loaded by Fivetran

tables:

- name: accounts

description: One row per Salesforce account

loaded_at_field: _fivetran_synced

freshness:

warn_after: {count: 12, period: hour}

error_after: {count: 24, period: hour}

columns:

- name: id

description: Salesforce account ID

tests:

- not_null

- unique

- name: name

description: Account name

tests:

- not_null

And in a staging model:

SELECT

id as account_id,

name as account_name,

industry

FROM {{ source('salesforce', 'accounts') }}

The source() function compiles to the full qualified table reference (raw.salesforce.accounts) but also registers the dependency in dbt's lineage graph — the model appears downstream of the source in the dbt DAG. Direct schema.table references are invisible to dbt's lineage tracking.

Source Freshness Monitoring

The most operationally valuable feature of sources is freshness checking. dbt source freshness compares the maximum value of a freshness timestamp column (loaded_at_field) against the current time and reports whether the source is within configured freshness thresholds.

Run freshness checking with:

dbt source freshness

This command queries every source table that has loaded_at_field configured and reports:

- **Pass** — the source is within the warn threshold

- **Warn** — the source is older than warn_after but newer than error_after

- **Error** — the source is older than error_after

Freshness checking is the first line of defence against silent data pipeline failures. An extract that stops refreshing will not produce obvious errors in downstream models — the models run successfully, they just process stale data. Without freshness monitoring, stale data propagates silently through the transformation layer into dashboards and reports. With freshness monitoring, the pipeline failure is detected at the source before it reaches downstream consumers.

Freshness checking is typically run as a step in the dbt orchestration workflow — before the main dbt build — so that model execution is skipped if source data is stale.

Source Tests

Source columns can be tested the same way model columns can — using the same not_null, unique, accepted_values, and relationships tests. Source tests run on the raw data before transformation:

- name: status

tests:

- accepted_values:

values: ['open', 'closed', 'pending', 'cancelled']

Source tests serve a different purpose than model tests. Model tests verify that transformation logic produces correct outputs. Source tests verify that raw inputs match expectations. A source test that fails indicates a problem with the upstream data pipeline or source system, not with the dbt transformation logic.

Testing sources separately from models also makes debugging faster: when a test fails, you know immediately whether the issue is in the source data (source test failure) or in the transformation logic (model test failure).

Source Documentation

Sources support the same documentation features as models: description fields at the source, table, and column levels, rendered in the dbt docs site. Well-documented sources are the foundation for data discovery — they describe the origin, meaning, and quality characteristics of the raw data that feeds every downstream transformation.

Source documentation should include:

- The system the data comes from (Salesforce, Stripe, PostgreSQL operations database)

- The tool or process that loads the data (Fivetran, Airbyte, custom pipeline)

- The refresh cadence

- Known data quality issues or limitations

- Contact information for the team responsible for the data

This documentation is especially valuable for onboarding new analytics engineers and for debugging data quality issues in production.

Staging Models and the Source Convention

The standard dbt project convention is that staging models — the first layer of transformation — are the only models that reference sources directly. All other models reference staging models, not sources.

This convention produces a clean separation:

- Sources: raw, as-loaded data with source() references

- Staging: one-to-one with sources, applying basic cleaning and renaming

- Marts: business logic built on staging models, never on sources directly

The convention ensures that if a source schema changes, the impact is confined to the corresponding staging model. Downstream mart models do not need to be updated because they never reference the source directly.

Source Overrides for Testing

dbt supports replacing source references with custom queries using --vars or the source override capability, allowing CI tests to run against test data rather than production source tables. This is useful for CI environments where the raw tables may not contain appropriate test data or where running dbt tests against production operational systems is undesirable.

Our data architecture practice designs dbt project structures including source configuration, freshness monitoring, and staging architecture for enterprise analytics teams — contact us to discuss your dbt project design.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →