BlogData Engineering

dbt Documentation: How to Write It, Maintain It, and Make It Actually Useful

James Okafor
James Okafor
Data & Cloud Engineer
·July 14, 202710 min read

dbt provides a documentation layer that, when used well, makes an analytics codebase genuinely self-documenting. When used poorly — descriptions that restate the column name, stale documentation that does not match the current model — it creates noise rather than value. This guide covers what good dbt documentation looks like and how to maintain it at scale.

dbt's documentation layer — column descriptions in schema.yml files, model-level descriptions, and the generated docs site — can make an analytics codebase genuinely self-documenting. When business users, analysts, and new team members can find what they need in the dbt docs without asking someone, the documentation is working. When the docs site contains hundreds of undescribed columns, outdated descriptions that do not match the current model logic, and model names that require domain knowledge to parse, the documentation is noise.

The difference between useful dbt documentation and noise is almost entirely about what you write in each description — not about how much you write.

What Good Column Documentation Looks Like

Column documentation should answer the questions that a new analyst would ask before using the column:

- What does this column mean in business terms?

- What is the unit or format?

- What values are valid or typical?

- Are there any important edge cases or known quirks?

What bad column documentation looks like: a description that restates the column name. If the column is named 'customer_id' and the description says "The customer ID", the documentation has provided no information beyond what the name already conveyed.

Compare:

**Not useful**: "customer_lifetime_value: The customer lifetime value."

**Useful**: "customer_lifetime_value: The total revenue earned from this customer across all closed transactions to date, in USD. Excludes refunded transactions. This is a historical figure; see 'predicted_lifetime_value' for the forward-looking estimate. New customers who have not yet transacted will show 0."

The useful description:

- States the precise definition (total revenue, closed transactions, USD)

- States what is excluded (refunds) — important for correct interpretation

- Distinguishes from a related concept (predicted LTV) that users might confuse it with

- States the expected value for a specific edge case (new customers)

Model-Level Documentation

Model descriptions should explain why the model exists — the analytical purpose it serves — and what it produces, including the grain.

What not to write: a technical description of what the SQL does. Anyone reading the model can see the SQL; restating it in prose adds nothing.

What to write: the business question the model answers, the grain (one row per what?), any important caveats about the data, and any known data quality issues that users should be aware of.

Example model description:

"Orders summary for the finance and revenue analytics team. One row per calendar month, per sales region, per product category. Revenue is recognised at shipment date, not order date. Excludes internal test orders (flagged in the orders source). Known issue: pre-2022 orders do not have region assigned (region is null) due to a data gap in the legacy order system."

This description tells a finance analyst everything they need to know before deciding whether this model serves their needs — and what limitations to be aware of if they use it.

Documentation for Sources

Source documentation is often neglected because sources are external — "not our data." But analysts connecting to sources need the same information they need for any other data asset.

For each source table, document:

- What system it comes from and what it represents

- The refresh frequency or lag

- Known data quality issues in the source (the system allows nulls in a field that should always be populated; the data is available with a 24-hour lag)

- Any important differences between the source's field names and what they mean in business context

Source documentation at this level prevents analysts from misinterpreting source data or assuming freshness that the source cannot provide.

Maintaining Documentation at Scale

Documentation maintenance is the harder problem. Documentation written at model creation becomes inaccurate as models evolve. A description that was correct in version 1 of the model may be wrong after version 3 changed what the model includes.

**Documentation review in pull requests**: Add documentation accuracy to the code review checklist. When a model is modified, the reviewer checks whether the existing column descriptions are still accurate and whether the model-level description reflects the current design. This catches documentation drift before it reaches main.

**New columns require descriptions before merge**: Set a project norm that any pull request adding a new column must include a column description. Enforce this in review; do not accept merges of undescribed columns. This prevents documentation debt accumulation at the source.

**Periodic documentation audits**: Quarterly, review the dbt docs site for the most-used models. Are the descriptions still accurate? Are there columns that have been added since the last review without descriptions? The audit should be lightweight — a 2-hour review for the top 20 models, not a comprehensive review of every model.

**Use meta fields for structured information**: dbt's meta field in schema.yml accepts arbitrary key-value pairs that appear in the generated docs. Use meta for structured information you want consistently: owner (the team responsible), sensitivity_level (for access control documentation), source_system (for lineage documentation). Structured meta fields enable programmatic documentation querying via the dbt Metadata API.

Making the Docs Site Accessible

The dbt docs site (hosted via 'dbt docs serve' locally or deployed via dbt Cloud) is only useful if people know it exists and can find what they need in it.

**Publish and promote the docs site**: Deploy the docs site to a URL that all analysts can access. Link to it from the data team's Notion/Confluence pages, Slack channel descriptions, and onboarding documentation. A docs site that no one visits is as useful as no docs.

**Search-first navigation**: The dbt docs site's search is its most useful navigation feature for large projects with many models. Ensure model names and descriptions use the business terms that users would search for — not data engineering shorthand.

Our data architecture and data engineering practice builds dbt projects with documentation standards embedded in the development workflow — contact us to discuss analytics engineering practices for your team.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →