BlogData Architecture

Metadata Management: Building the Foundation for Data Discovery and Governance

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·August 4, 202711 min read

Metadata management is the systematic collection, organisation, and maintenance of information about data assets — what exists, where it lives, what it means, who owns it, and how it relates to other data. Without it, data governance is aspirational; with it, governance becomes operational.

Metadata management is the systematic collection, organisation, and maintenance of information about data assets — what exists, where it lives, what it means, who owns it, and how it relates to other data. It is the infrastructure that makes data governance operational rather than aspirational. Without metadata management, data governance programmes produce policies that are difficult to enforce and catalogues that are quickly out of date. With it, governance becomes tractable: ownership is discoverable, sensitive data is findable, lineage is traceable, and impact analysis is fast.

The Types of Metadata

Metadata falls into three categories, each serving different use cases:

**Technical metadata** describes the physical structure of data: table and column names, data types, row counts, partition structure, storage location, and schema history. Technical metadata is typically generated automatically by the data platform (warehouses, lakes, and streaming platforms expose it through system tables and APIs) and should not require manual maintenance. It is the foundation — without accurate technical metadata, everything else is built on uncertain ground.

**Business metadata** describes what data means in business terms: field definitions, business context, owner, domain, and any relevant business rules. This is the metadata that makes data discoverable and usable for new consumers. It requires human input — it cannot be generated automatically — and it degrades over time as the business evolves without the metadata being updated. The primary challenge in metadata management programmes is maintaining business metadata currency, not collecting it initially.

**Operational metadata** describes how data assets are used: who accesses them, how often, from which tools, in which queries. Query logs, access patterns, and BI platform usage statistics are the primary sources. Operational metadata serves several functions: identifying the most actively used and therefore most critical assets; finding data that appears to be abandoned (potentially a candidate for deprecation); and understanding the blast radius of a proposed change (which users and tools depend on a specific table or field).

The Data Catalogue

A data catalogue is the interface through which metadata is exposed to users — the searchable directory of what data exists, what it means, and how to access it. The catalogue is only as good as the metadata behind it; catalogues populated with automatically harvested technical metadata and sparse business metadata are useful for engineers but not for analysts or business users who need semantic context.

Building a useful catalogue requires a deliberate metadata collection workflow:

**Automated harvest** from the warehouse, data lake, and BI tools for technical metadata: schemas, row counts, last-updated timestamps, access logs. This should run continuously without manual intervention.

**Domain ownership assignment** — mapping every data asset to an owning team. This is often the most politically difficult part of metadata management: assets that are used by many teams may have unclear ownership. The practical resolution is to distinguish between producers (the team responsible for the quality and currency of the data) and consumers (teams that use but do not maintain it), and to require every asset to have an identified producer owner.

**Business definitions for high-value assets** — prioritising semantic documentation for the datasets that are most widely used and most likely to be misunderstood. Attempting to document everything at launch is how metadata initiatives stall; cataloguing the top 20% of assets by usage produces 80% of the governance value.

**Review and decay management** — a process for periodically reviewing and updating business metadata, and for deprecating assets that are no longer used. Metadata that is never reviewed develops a large gap between what it says and what is true, which erodes user trust in the catalogue.

Metadata and Data Discovery

The business value of metadata management is primarily realised through data discovery: analysts and data scientists finding data assets they did not know existed, rather than building new assets that duplicate existing ones.

Data discovery requires three things:

**Search** — the ability to find data assets by name, description, domain, or keyword. Most catalogue tools provide this; the quality of search results depends on the completeness and accuracy of the business metadata.

**Lineage** — the ability to navigate from a dashboard metric back through the transformation layers to the source data, or forward from a source table to all the downstream assets that depend on it. Lineage enables both impact analysis (what breaks if I change this?) and quality investigation (why does this metric look wrong?).

**Trust signals** — indicators that help users assess whether a data asset is appropriate for their use case. Certified or endorsed status (the asset has been reviewed and validated by the owning team), last-updated timestamp, known quality issues, and documentation completeness all serve as trust signals. Without trust signals, users route queries to the data team rather than self-serving from the catalogue, which defeats the purpose.

Tooling Considerations

The metadata management tooling landscape includes:

**Open-source catalogues** — Apache Atlas (enterprise-grade, complex to operate), DataHub (LinkedIn-origin, strong lineage, active community), OpenMetadata (newer, API-first, integrates with most modern data stack tools). All provide automated ingestion from common data platforms and a catalogue UI.

**Commercial catalogues** — Alation, Collibra, Atlan. Higher total cost; typically stronger governance workflow features (approval processes, stewardship workflows, policy management) that open-source tools do not fully cover.

**Warehouse-native catalogues** — Snowflake's data governance features, BigQuery's data catalogue integration with Google Data Catalog. Limited to the specific platform; insufficient for multi-platform environments.

For most organisations, the practical starting point is a lightweight open-source catalogue (DataHub or OpenMetadata) with automated ingestion from the warehouse and BI tools, manual business metadata entry for high-priority assets, and an ownership registry maintained in the catalogue. Catalogue tooling can be upgraded when the use cases and user base justify it; starting with a simpler tool and building usage habits is more valuable than starting with an enterprise platform that is underused.

Our data architecture and data governance practice helps organisations implement metadata management programmes that produce usable catalogues and operational governance — contact us to discuss your metadata management strategy.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →