BlogData Engineering

dbt Seeds: Managing Static Reference Data in Your Analytics Project

Austin Duncan
Austin Duncan
Project Manager & Data Strategist
·November 3, 20279 min read

dbt seeds are CSV files that dbt loads into the data warehouse as tables, providing a version-controlled mechanism for managing static reference data — dimension lookup tables, configuration values, business rules, and other small datasets that do not change frequently and should be part of the analytics codebase rather than a separate data pipeline.

dbt seeds are CSV files committed to the dbt project repository that dbt loads into the warehouse as tables. They provide a version-controlled, auditable mechanism for managing the small, static reference datasets that analytics pipelines need — country code lookups, product category mappings, fiscal calendar definitions, cost centre hierarchies, and other data that does not fit cleanly into either a live data source (too small to warrant an ingestion pipeline) or a hardcoded calculation (too complex to maintain as a CASE WHEN statement).

What Seeds Are For

Seeds occupy a specific niche in the data stack. They are appropriate when:

**The data changes infrequently** — seeds are loaded when dbt seed is run. For data that changes daily or more frequently, a proper data pipeline is more appropriate. Seeds are right for data that changes monthly, quarterly, or less.

**The data is maintained by the analytics team** — seeds are in the analytics codebase. They are appropriate for reference data where the analytics team owns the definition — fiscal calendars, analytical mappings, business rules the data team maintains. They are not appropriate for reference data maintained by other systems (product catalogs maintained by a product database, for example).

**The dataset is small** — seeds are CSV files committed to version control. Large CSV files make the repository slow to clone and difficult to review in pull requests. As a rule of thumb, seeds should be under 10,000 rows. Larger reference datasets should be in the warehouse and referenced as sources.

**History and auditability matter** — because seeds are in version control, every change to a seed is recorded with a commit message, timestamp, and author. This audit trail is valuable for reference data where the history of changes matters — knowing when a fiscal calendar definition changed, for instance.

Seed Configuration

Seeds are configured in dbt_project.yml and can have per-seed configuration in the seeds/ directory schema files:

dbt_project.yml:

seeds:

your_project_name:

+schema: seed_data

country_codes:

+column_types:

country_code: varchar(2)

country_name: varchar(100)

YAML documentation for a seed:

version: 2

seeds:

- name: fiscal_calendar

description: Company fiscal calendar mapping calendar dates to fiscal periods

columns:

- name: calendar_date

description: Calendar date

tests:

- not_null

- unique

- name: fiscal_year

description: Fiscal year number (e.g., 2024 for FY2024)

- name: fiscal_quarter

description: Fiscal quarter (Q1-Q4 within fiscal year)

Seeds support the same documentation and testing features as models — column descriptions and data tests are applied to the loaded seed table.

Common Seed Use Cases

**Country and region mappings** — a CSV mapping ISO country codes to country names, regions, and sub-regions. Used in models that need to roll up international data into geographic segments.

**Fiscal calendar** — a CSV with one row per calendar date, with columns for fiscal year, fiscal quarter, fiscal month, fiscal week, and fiscal period. This enables accurate fiscal period reporting without requiring date arithmetic in every model.

**Cost centre hierarchy** — a CSV mapping cost centre codes to their parent hierarchy — department, division, business unit. Used for allocating shared costs or rolling up financial data through the organisation structure.

**Product category hierarchy** — a CSV mapping individual SKUs or product codes to their analytical category hierarchy. Maintained by the analytics team based on how products should be grouped for reporting, which may differ from the product database's own categorisation.

**Exchange rate overrides** — a CSV of historical exchange rates used for currency conversion, where the analytics team wants explicit control over the rates used rather than pulling live rates from an external API.

Loading Seeds

Run seeds with:

dbt seed

By default, dbt drops and recreates seed tables on every dbt seed run. For seeds that are large or infrequently changed, add --full-refresh false to skip recreation if the table already exists with the correct schema.

Seeds can be run as part of a larger dbt workflow:

dbt seed && dbt run && dbt test

This ensures seed tables are always up to date before model execution begins.

Seeds in Testing Environments

Seeds are especially useful in CI and development environments for providing test reference data. A seed that maps test customer IDs to test customer attributes enables CI tests to run against realistic data without depending on production sources.

For environments that need to test against realistic data sizes, seeds can be split into a full version (for production) and a truncated version (for CI), with the appropriate version selected via dbt target configuration.

When Not to Use Seeds

**Transactional data** — seeds are inappropriate for any data that changes as a result of business operations. Use sources to reference operational tables, not seeds.

**Large datasets** — CSV files above a few thousand rows are cumbersome to review in pull requests and slow to load on every dbt seed run. Large reference datasets belong in the warehouse as source tables.

**Data maintained by other teams** — if the data is owned by another team's system, reference it as a source. Only use seeds for data the analytics team owns and maintains.

Our data architecture practice designs dbt project structures including seed management and reference data strategies for enterprise analytics teams — contact us to discuss your dbt project design.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →