Enterprise Data Catalog: Implementation Guide for Large Organisations

A data catalog makes data discoverable, documented, and governed at enterprise scale. This guide covers the implementation decisions — tool selection, metadata ingestion strategy, stewardship model, and the governance workflows that make a catalog actually used.

An enterprise data catalog makes data discoverable, documented, and governed at scale. Without one, data knowledge lives in individuals' heads, Confluence pages that are never updated, and Slack threads that become unsearchable six months later. When a new analyst joins and asks "where do I find customer revenue data?", they get a different answer from every person they ask.

This guide covers the implementation decisions: tool selection, metadata ingestion strategy, the stewardship model, and the governance workflows that make a catalog actually used rather than shelfware.

What a data catalog actually needs to do

Before selecting a tool, define what your catalog must provide:

**Discovery**: An analyst searching "customer churn" should find the churn model, the churn fact table, the churn calculation definition, and the reports that surface churn metrics. Search across table names, column names, descriptions, and tags.

**Context**: Not just "this table exists" but "this table is the source of truth for revenue, it is owned by the data team, it is refreshed daily at 6am, and it should be preferred over the legacy finance_revenue table which is deprecated."

**Lineage**: Where does this data come from? What reports or models depend on it? If I change the revenue calculation in this model, what downstream assets will be affected?

**Quality indicators**: Is this data trustworthy? When was it last refreshed? What quality tests passed or failed last run? Is it certified or just a work-in-progress?

**Access**: Who can see this data? How do I request access if I do not have it?

A catalog that provides discovery and context but not lineage or quality indicators is a documentation tool, not a governance tool. Most enterprise catalog purchases justify the cost by providing all five.

Tool selection framework

The leading enterprise catalog tools — Collibra, Alation, Atlan, Microsoft Purview — have different strengths:

**Collibra**: Strongest governance workflow engine. Policy management, stewardship workflows, issue tracking, data quality rules with remediation workflows. Requires significant configuration — Collibra is a platform, not a plug-and-play product. Most appropriate for heavily regulated industries where governance workflows must be documented and auditable. Expensive.

**Alation**: Strongest query-based intelligence. Alation mines query history from connected databases to automatically identify popular tables, common column usage, and query patterns. The "Query Intelligence" feature surfacing what queries other users write against a table is unique to Alation. Good for organisations with heavy SQL analyst populations.

**Atlan**: Strongest modern data stack integration. Native connectors for dbt (imports dbt model descriptions and lineage from manifest.json), Airflow, Fivetran, Looker, Tableau, and Snowflake. Active metadata — lineage and usage statistics are updated automatically from tool integrations rather than requiring manual maintenance. Lower implementation overhead than Collibra or Alation. Growing fast; the product has matured significantly.

**Microsoft Purview**: Strong for Microsoft ecosystem. Native integration with Azure Storage, Synapse, Fabric, SQL Server, Power BI, and Office 365. If your organisation's data estate is primarily on Azure and Microsoft tools, Purview provides catalog, lineage, and classification with minimal incremental cost (included in many Azure licensing tiers). Not competitive for non-Microsoft environments.

**OpenMetadata / DataHub**: Open-source options. Lower cost; require engineering investment to deploy and maintain. DataHub originated at LinkedIn and has a large community. OpenMetadata is newer with a strong connector ecosystem. Appropriate for organisations with engineering capacity to run the platform and aversion to commercial tool licensing.

Metadata ingestion strategy

A catalog is only as good as its metadata. The two approaches:

**Automated ingestion (pull)**: The catalog connector scans connected systems — Snowflake, BigQuery, Redshift, dbt, Airflow, Tableau — and extracts table schemas, column definitions, lineage, query history, and data quality test results. This keeps metadata current without manual effort. Set automated scans to run on a schedule (daily for schemas and lineage; near-real-time for usage statistics).

**Manual curation (push)**: Human-authored descriptions, business context, stewardship ownership, sensitivity classifications, and certification status. This cannot be automated — it requires the data team and domain stewards to author and maintain content.

In practice: automate everything automatable (schema, lineage, statistics, test results, query history); require humans only for the content that requires judgment (business descriptions, ownership assignment, certification decisions).

The biggest catalog failure mode is treating manual curation as the primary ingestion method. If descriptions are manually authored and not updated, they rot. If lineage is manually drawn and not automatically maintained, it becomes wrong when pipelines change. Automate first; curate on top.

Stewardship model

A data catalog without assigned stewards is a searchable graveyard of undocumented tables. Stewardship assigns human responsibility for catalog content:

**Domain stewards**: Business domain experts (Finance, Marketing, Operations) responsible for the business-layer descriptions, metric definitions, and certification decisions within their domain. They answer "what does this metric mean for the business?"

**Technical stewards**: Data engineers or analytics engineers responsible for the transformation logic, technical implementation notes, and lineage within their domain models. They answer "how is this metric calculated?"

**Data owners**: Senior stakeholders accountable for a data domain's accuracy and completeness. Not day-to-day stewards, but approvers for certification decisions and escalation points for quality issues.

Define the stewardship matrix before deploying the catalog: which data domains exist, who is the domain steward and owner for each, and what they are expected to maintain. A catalog with documented stewardship assignments is governable; a catalog with anonymous undocumented assets is not.

Certification workflow

Certified data assets signal that a table, metric, or report has been reviewed, meets quality standards, and is endorsed as the authoritative source for its business concept. The certification workflow:

1. Asset author (analytics engineer) submits a certification request with documentation complete

2. Domain steward reviews: business description accurate, metric definition agreed, known limitations documented

3. Data owner approves certification

4. Asset receives the Certified badge in the catalog

Only a handful of assets should be Certified in most catalogs — the core dimensional models, the primary metric definitions, the canonical reporting tables. Certification is not a quality score; it is an endorsement of authoritative status.

Making the catalog actually used

The most common catalog implementation failure is low adoption — the catalog is populated, launched, and then ignored. Drivers of adoption:

**Catalog as the first stop for data questions**: Mandate that the answer to "where is the customer data?" is a catalog link, not a Slack message. Catalogue the responses to the most common data questions and link to catalog assets in onboarding documentation.

**Integrate with the workflow**: The catalog should be accessible from the tools analysts already use. Atlan has a Chrome extension; Alation integrates with SQL editors; Collibra integrates with Jira. Catalog access within existing tools reduces the friction of switching context.

**Make staleness visible**: Show last-updated date on every catalog asset. A table description last updated 18 months ago is visibly stale. Stewards who can see their assets are stale will update them; stewards who cannot see the age of their content will not.

For the tooling context, see data catalog tools. Our data architecture consulting practice designs and implements enterprise catalog programs — book a free consultation to discuss your data governance maturity.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →