BlogData Architecture

What Is a Data Catalog? Finding and Trusting Your Data Assets

Austin Duncan
Austin Duncan
Project Manager & Data Strategist
·March 15, 202810 min read

A data catalog is a centralized inventory of an organization's data assets — tables, dashboards, columns, and pipelines — with the metadata needed to find, understand, and trust them. This guide explains what a data catalog does, what it cannot do alone, and how it fits into a broader data governance strategy.

A data catalog is a centralized inventory of an organization's data assets — tables, dashboards, columns, pipelines, and data products — with the metadata needed to find, understand, and trust them. It answers the questions data analysts and engineers ask constantly: what data do we have, where does it live, what does this column mean, who owns this table, is this the right version to use, and is this data fresh?

The Problem Without a Catalog

At a small organization, these questions are answered by knowing the right person to ask or by having been the person who built the pipeline. At scale, this breaks down. Data teams grow, the number of tables multiplies, people leave and take tribal knowledge with them. An analyst spends hours determining whether to use "users" or "dim_users" or "users_final" — and makes the wrong choice. Two teams build the same metric with different business logic, producing different numbers from the same source, leading to conflicting reports in the same board meeting.

These are symptoms of missing metadata, not missing data. The data exists. The problem is that it cannot be found, understood, or trusted without significant effort.

What a Data Catalog Contains

**Technical metadata** — the structural facts: table names, column names, data types, row counts, null rates, primary keys, foreign keys. This is the schema-level information that tells you what exists and how it is structured. Modern catalogs ingest technical metadata automatically from warehouses and databases via connectors.

**Operational metadata** — the runtime facts: when was this table last updated, how often does the underlying pipeline run, what was the row count at the last load, has this table failed quality checks recently? Operational metadata answers whether the data is current and healthy.

**Logical metadata** — the business context: what does this table represent, what business process generates the "revenue" column, which team owns this dataset, what are the known limitations or caveats? This is the metadata that cannot be inferred from schema — it requires human documentation, though AI-assisted tools can draft it from usage patterns.

**Lineage** — the upstream and downstream relationships: which pipeline produces this table, which tables feed into it, which dashboards consume it. Lineage is a component of catalog metadata; the combination of a catalog and lineage tracking creates a navigable map of the data environment.

**Usage metadata** — who queries this table, how often, which dashboards depend on it, which tables are most used and which are abandoned. Usage metadata helps prioritize documentation effort and identify assets that are business-critical despite lacking formal ownership.

Active vs Passive Cataloguing

Most catalogs start as passive — they ingest metadata from connected systems but do not enforce standards. You can search for a table, see its schema and lineage, and read any documentation that exists. This is useful, but its value depends entirely on the quality of documentation that has been added.

Active cataloguing integrates the catalog into workflows. dbt models generate documentation (column descriptions, tests, tags) as part of the transformation build — the catalog pulls this documentation automatically. Data quality check results are published to the catalog, so freshness and quality status are visible alongside schema metadata. Ownership assignments in the catalog trigger notifications when dependent assets fail.

The shift from passive to active cataloguing is what separates a catalog that is genuinely used from one that is populated once and gradually ignored.

Common Data Catalog Tools

**dbt Docs** — for teams running dbt, the built-in documentation site is a functional catalog for the transformation layer. It shows all models, their lineage, column descriptions (when documented), test results, and source dependencies. Limited to dbt-managed assets; does not cover ingestion pipelines, BI dashboards, or undocumented source tables.

**DataHub** (open-source, LinkedIn) — comprehensive metadata platform covering ingestion connectors for warehouses, databases, BI tools (Tableau, Looker, Power BI), transformation tools (dbt, Spark, Airflow), and streaming systems. Provides lineage, schema history, ownership, and search. The most complete open-source option for full-stack coverage.

**Apache Atlas** — widely used in Hadoop/HDP/CDP environments. Strong lineage from Hive, Spark, and Sqoop. Less well-suited to modern cloud warehouse stacks.

**Alation** — commercial data catalog and data intelligence platform. Strong search and AI-assisted documentation, data stewardship workflows, and governance features. Enterprise pricing.

**Collibra** — governance-focused commercial platform with strong policy management, business glossary, and compliance workflows. Common in regulated financial services and healthcare organizations.

**Monte Carlo** — data observability platform that includes catalog features alongside monitoring and alerting. Particularly strong on operational metadata (freshness, row count anomalies, schema changes).

**Atlan** — modern SaaS catalog with strong BI tool integrations, AI-assisted documentation, and collaboration features. Popular with data teams on Snowflake and dbt.

What a Data Catalog Cannot Do Alone

A catalog is a discovery and documentation layer, not a governance enforcement layer. It records what exists and surfaces metadata about it; it does not prevent bad data from being published, enforce schema standards, or make data quality problems visible before they reach users.

Effective data management requires a catalog plus: data quality monitoring (to detect problems when they occur), lineage (to understand impact), access control (to enforce who can see what), and data quality testing at the pipeline level (to prevent bad data from loading in the first place). The catalog becomes the surface through which all of this is visible — but the other components must exist for the catalog to surface meaningful information.

When to Invest in a Catalog

For small data teams (one to three people, one warehouse, a manageable number of tables), informal documentation in dbt and a shared notion page may be sufficient. The overhead of maintaining a separate catalog tool may not be worth it.

The signal that a catalog is warranted:

- Analysts routinely spend significant time finding the "right" table among many candidates

- Multiple versions of the same metric exist with no clear canonical source

- Key staff turnover creates significant knowledge gaps about what data means and where it lives

- Compliance requirements demand documented data ownership and access controls

- The number of active data assets has grown beyond what informal documentation can cover

Our data architecture practice implements data governance and cataloguing solutions for complex analytics environments — contact us to discuss your metadata management requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →