BlogData Governance

What Is a Data Dictionary? Documenting the Meaning of Your Data

Austin Duncan
Austin Duncan
Project Manager
·August 21, 20288 min read

A data dictionary is a structured reference document that defines the meaning, source, format, and ownership of data elements in a system. This guide explains what a data dictionary contains, why it matters for analytics and governance, and how dbt makes it easier to maintain.

A data dictionary is a reference document that defines the data elements in a system — what each field means, where it comes from, what values it can take, who owns it, and how it relates to other fields. It is the authoritative source of record for the semantics of data, not just its structure.

Every organization has implicit data dictionaries — the knowledge that lives in the heads of the analysts and data engineers who built the tables. "revenue" in the sales table includes refunds but excludes tax; "active" in the user status field means paid in the last 30 days; "customer" excludes internal test accounts. When that knowledge is documented, it is a data dictionary. When it is not, it is tribal knowledge that walks out the door with every departure.

What a Data Dictionary Contains

For each documented data element:

**Name:** The field name as it appears in the database or data model, typically in snake_case for warehouse tables.

**Description:** A plain-language explanation of what the field represents. Not what the data type is — that is self-evident from the schema. What it means: "Revenue from customer orders, net of refunds, excluding tax and shipping. Includes subscription renewals. Does not include professional services revenue."

**Source:** Where the data originates — which source system, which API endpoint, which upstream table in the transformation pipeline. Lineage in prose form.

**Business owner:** The person or team responsible for the accuracy and meaning of this field — typically the business stakeholder who owns the underlying process (Finance owns revenue definition, Product owns user activity metrics).

**Data type and format:** INTEGER, VARCHAR(255), DATE, TIMESTAMP — plus format conventions where relevant (ISO 8601 for dates, UTC for timestamps).

**Accepted values:** For categorical fields, the set of valid values and what each means. status = 'active' | 'churned' | 'trial' | 'paused' with descriptions of each transition.

**Nullability:** Whether the field can be NULL and under what circumstances.

**Calculation logic:** For derived or calculated fields, the business logic or formula. "Days since last login, calculated from auth_events.last_login_at and CURRENT_DATE."

**Notes and caveats:** Historical quirks, known data quality issues, deprecated interpretations. "Prior to 2022-03-01, this field was populated from a different source and may be inconsistent. Filter to created_at > '2022-03-01' for reliable data."

Why Data Dictionaries Fail

Most organizations have attempted a data dictionary and abandoned it. The failure mode is almost always the same: the dictionary is created as a separate artifact (a Confluence page, a spreadsheet, a standalone tool) that must be manually updated whenever the data model changes. Within months, the dictionary is out of date. It describes tables that no longer exist, uses field names that have been renamed, and omits columns that have been added. An out-of-date data dictionary is worse than no dictionary — it actively misleads.

The solution is not better discipline; it is co-location. Documentation that lives with the code it describes survives; documentation that lives separately does not.

dbt Documentation as Data Dictionary

dbt's YAML schema files provide data dictionary functionality co-located with transformation logic. Each model and column can have a description, tests, and metadata defined in YAML alongside the model's SQL:

When a model is renamed, the developer updates both the SQL and the YAML in the same pull request. When a column is deprecated, the description is updated. When a test is added to validate a business rule, the documentation describes it. The dbt docs command generates a browsable documentation site from these YAML definitions — an always-current data dictionary that is version-controlled in the same repository as the transformation code.

dbt column descriptions can be reused across models via source and column-level inheritance — defining a field's canonical description once and applying it wherever the field appears.

Data Catalog vs Data Dictionary

A data dictionary describes the meaning of specific fields and tables. A data catalog is a broader system that indexes all data assets in an organization's environment, providing search, lineage, quality monitoring, and governance workflow on top of the documented definitions.

Tools like Atlan, Alation, Collibra, and DataHub are data catalogs — they integrate with dbt, Tableau, Snowflake, and other systems to pull metadata and present a unified discovery interface. The data dictionary (the field definitions and descriptions) feeds into the catalog; the catalog provides the organizational infrastructure around it.

For organizations building their first data documentation practice, starting with dbt's built-in YAML documentation provides immediate, low-cost value. For larger organizations with complex multi-system environments and governance requirements, a dedicated catalog tool may be warranted.

Our data architecture services and BI strategy practice helps organizations build data documentation practices — from initial dbt schema documentation through catalog tool implementation and governance workflow design. Contact us to discuss your data governance requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →