dbt Sources: How to Configure, Document, and Test Your Data Sources

dbt sources define the raw tables that your transformations build on — the ingested data from Fivetran, Airbyte, or custom pipelines. Properly configured sources enable source freshness testing, consistent referencing across models, and clear lineage from raw data through to marts. This guide covers every aspect of dbt source configuration.

Sources in dbt define the raw tables that your transformation models build on. They serve three purposes: they make source-to-mart lineage explicit and visible, they enable source freshness testing that alerts when upstream data is stale, and they create a maintainable reference point for raw tables that can be updated in one place if the underlying schema changes. Most dbt projects underutilise sources. This guide covers everything the sources layer should do.

Defining Sources in schema.yml

Sources are defined in schema.yml files (or equivalently in a dedicated sources.yml file) within the models directory. A minimal source definition:

version: 2

sources:

- name: salesforce

database: raw

schema: salesforce

tables:

- name: opportunity

- name: account

- name: contact

The name is the identifier used in dbt ref: source('salesforce', 'opportunity'). Database and schema tell dbt where to find the raw tables. Tables lists the individual source tables.

In model SQL, reference source tables using the source() macro:

select *

from {{ source('salesforce', 'opportunity') }}

This is functionally equivalent to the raw table reference, but dbt uses it to build the lineage graph, include the source in the docs site, and run source freshness checks.

Source Descriptions and Documentation

Sources should be documented with the same rigour as models. The source definition supports descriptions at the source level, table level, and column level:

sources:

- name: salesforce

description: "Salesforce CRM data synced via Fivetran. Reflects current state of Salesforce objects as of last Fivetran sync."

database: raw

schema: salesforce

tables:

- name: opportunity

description: "Salesforce opportunities. One row per opportunity. Includes all stages from prospecting through closed-won and closed-lost."

columns:

- name: id

description: "Salesforce opportunity ID. Primary key."

- name: amount

description: "Opportunity value in USD. Null for opportunities without an entered amount."

- name: stage_name

description: "Current opportunity stage. Values: Prospecting, Qualification, Needs Analysis, Value Proposition, Decision Makers, Perception Analysis, Proposal/Price Quote, Negotiation/Review, Closed Won, Closed Lost."

Source documentation appears in dbt docs alongside model documentation, giving analysts and downstream users a complete picture of the lineage from raw source through to marts.

Source Freshness Testing

Source freshness testing is one of the most valuable features in dbt sources. It defines expected data freshness and alerts when the data is overdue.

sources:

- name: salesforce

database: raw

schema: salesforce

freshness:

warn_after:

period: hour

error_after:

period: hour

loaded_at_field: _fivetran_synced

tables:

- name: opportunity

- name: account

The loaded_at_field is the column that indicates when the row was last loaded into the raw table. Fivetran adds _fivetran_synced to every table it manages. Other ingestion tools use different column names; check your ingestion tool's documentation.

Running dbt source freshness evaluates the maximum loaded_at_field value for each source table against the configured thresholds:

- If the most recent row is older than the warn_after threshold, a warning is returned

- If older than the error_after threshold, an error is returned

**Pipeline integration:** Run dbt source freshness before dbt run in your production pipeline. A freshness error indicates upstream data has not arrived — proceeding to run dbt models on stale data produces stale outputs that may actively mislead business users. The freshness check is the early warning system.

**Per-table overrides:** Individual tables can override the source-level freshness configuration:

tables:

- name: opportunity

freshness:

warn_after:

period: hour

loaded_at_field: _fivetran_synced

- name: account

freshness: null # No freshness check — this table changes infrequently

Tables where no freshness check is appropriate (rarely-updated reference tables, static seed data synced weekly) can disable freshness checking explicitly with freshness: null.

Source Tests

Sources support the same generic tests as models — not_null, unique, accepted_values, and relationships. Source tests run against the raw tables before transformation:

tables:

- name: opportunity

columns:

- name: id

tests:

- not_null

- unique

- name: stage_name

tests:

- accepted_values:

values: ['Prospecting', 'Qualification', 'Needs Analysis', 'Closed Won', 'Closed Lost']

Source-level tests catch upstream data quality issues before they propagate through the transformation pipeline. A unique test on opportunity.id that fails indicates the source system produced duplicate records — this is an upstream problem that should be fixed at the source, not silently worked around in the staging model.

**Test severity on sources:** Most source tests should be warn severity, not error. Source data quality issues are upstream problems that the data team did not cause and cannot always fix immediately. Error severity would block the entire pipeline for upstream issues that may resolve on the next sync. Warn severity keeps the pipeline running while making the issue visible.

Organising Sources

Large dbt projects benefit from a deliberate source organisation strategy. Options:

**One sources.yml per source system:** A sources.yml file in the staging directory for each source system: salesforce_sources.yml, stripe_sources.yml, postgres_sources.yml. Easy to find all tables from a given source system.

**One sources.yml per domain:** A sources.yml for CRM sources, one for finance sources, one for product analytics sources. Aligns with domain-based model organisation.

**Centralised sources.yml:** All sources in a single file. Simple for small projects; unwieldy at scale.

The most common and maintainable approach for mid-size projects: one sources.yml per source system in a dedicated sources/ directory at the root of the models/ directory.

Source Configurations and Meta Fields

Sources support config blocks for additional properties:

sources:

- name: salesforce

config:

tags: ['crm', 'salesforce']

meta:

owner: "Sales Operations"

pii: true

access_tier: "restricted"

Tags enable selecting sources by category in dbt commands (dbt source freshness --select tag:crm). Meta fields store arbitrary metadata that appears in the dbt docs site and is accessible via the dbt Metadata API.

The owner and pii meta fields are particularly useful for governance: they make data ownership and sensitivity classification explicit and visible in the documentation without requiring a separate catalogue.

When to Add a Source vs. Use ref()

The distinction:

- Use source() for raw tables created by an ingestion layer (Fivetran, Airbyte, custom ETL) that dbt did not create

- Use ref() for tables created by dbt models

Never use raw SQL table references in dbt models — always either source() or ref(). This ensures that dbt's lineage graph is complete and that source freshness testing covers all upstream dependencies.

Our data engineering consulting practice implements dbt transformation architectures with complete source documentation and freshness testing — contact us to discuss dbt architecture for your data environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →