BlogData Architecture

Data Lineage: What It Is, Why It Matters, and How to Implement It

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·May 25, 202610 min read

Data lineage tracks where data comes from, how it has been transformed, and where it flows downstream. It has always mattered for data governance. AI makes it non-negotiable. Here is what lineage is, how it works technically, and what good lineage implementation looks like in practice.

The quick answer

Data lineage is the documented record of where a piece of data came from, what transformations it went through, and where it flows to downstream. It answers the question: "where did this number come from?" When a finance analyst disputes a revenue figure, lineage tells you which pipeline produced it, which source system it originated from, and which transformation logic was applied. When a regulator asks how a model's output was derived, lineage is the audit trail. When an AI system produces a recommendation, lineage determines whether the data feeding it can be trusted. Lineage has always been valuable. AI governance and data regulation have made it non-negotiable.

What data lineage tracks

Lineage operates at multiple levels of granularity:

**Pipeline-level lineage** — which pipelines flow from which sources to which destinations. System A feeds Pipeline 1, which writes to Table B, which feeds Dashboard C. This is the coarsest level — useful for impact analysis ("if Source A changes, what downstream assets are affected?") but not sufficient for debugging or compliance.

**Column-level lineage** — which source column produced which destination column, through which transformation. The revenue_usd column in the Gold layer was derived from the amount column in the raw transaction table, converted using the exchange_rate column from the currency dimension, filtered to exclude refunded transactions, and aggregated by month. Column-level lineage is what compliance and debugging actually require.

**Transformation-level lineage** — the specific code or query that performed each transformation step. Column A → transformation logic → Column B. For dbt-based pipelines, this is captured automatically from the dbt model SQL.

**Record-level lineage** — which source records contributed to which destination records. The most granular level — needed for GDPR right-to-erasure compliance (to find and delete a specific person's data across all downstream tables) and for fraud investigation (to trace a specific transaction through all derived datasets). Record-level lineage is the most expensive to capture and store at scale.

Most enterprise lineage implementations capture pipeline-level and column-level lineage. Record-level lineage is reserved for datasets with specific regulatory requirements.

Why lineage matters now more than ever

**AI governance.** AI systems that make business decisions need explainable, trustworthy inputs. Regulators and boards increasingly require that AI outputs be traceable to their data sources — that the data feeding the model is documented, governed, and of known quality. Without lineage, the AI system is a black box that consumes unknown data. With lineage, every input can be traced to its origin, its quality checks, and its transformation history.

**Data regulation.** GDPR Article 30 (Records of Processing Activities), HIPAA's audit requirements, PCI DSS's data flow documentation, and SOC 2's change management controls all require that organisations know where data comes from and where it goes. Data lineage is the technical foundation for demonstrating compliance. Without it, responding to a regulatory inquiry requires manual investigation that takes weeks and produces incomplete answers.

**Impact analysis for change management.** When a source system changes — a database schema is updated, a field is renamed, a new source replaces an old one — lineage identifies every downstream pipeline, table, and dashboard that will be affected. Without lineage, changes are deployed and problems are discovered when dashboards break and analysts complain. With lineage, impact is assessed before deployment and changes are made in the right sequence.

**Debugging data quality issues.** When a number is wrong in a dashboard, the debugging process without lineage is: identify the dashboard, trace the workbook's data source, identify the query, investigate the source table, check the pipeline that populates it, check the upstream source. This can take hours. With column-level lineage, the path is visible in a catalogue tool — click on the metric, see every transformation step back to the raw source.

How lineage is captured

Lineage can be captured through different mechanisms depending on the tools in the stack.

**Automatic lineage from dbt.** dbt captures column-level lineage automatically from the SQL in dbt models. When a dbt model references another model's column, dbt records that dependency. dbt's lineage graph is visible in dbt Docs and is available via the dbt API for ingestion into a data catalogue. For organisations using dbt for transformation, a significant portion of column-level lineage is captured without additional work.

**API-based lineage from Tableau and Power BI.** Tableau's Metadata API exposes lineage between Tableau published data sources and workbooks — which workbooks consume which data sources, which fields map to which database columns. Power BI's REST API exposes semantic model lineage. These APIs allow data catalogue tools to ingest BI-layer lineage automatically.

**Open Lineage for pipeline orchestration.** OpenLineage is an open standard for lineage metadata from pipeline orchestration tools. Apache Airflow, Spark, dbt, and other tools emit OpenLineage events that can be ingested by compatible catalogues (Marquez, Microsoft Purview, Atlan). For organisations with diverse orchestration tooling, OpenLineage provides a common format that avoids building separate integrations per tool.

**Manual lineage documentation.** For legacy pipelines, undocumented SQL stored procedures, or SSIS packages without API-based lineage capture, lineage must be documented manually — mapping source columns to destination columns by reading and understanding the transformation logic. This is slow and becomes stale quickly as code changes. Manual documentation should be a temporary measure while migrating to automated lineage capture.

**Lineage inference from query logs.** Some catalogue tools (Monte Carlo, Atlan) infer lineage from query logs — observing which tables and columns are referenced together in queries over time and inferring relationships from those patterns. This approach can surface lineage for assets that are not covered by other automated methods, but the inferred lineage may be incomplete or incorrect. Useful as a supplement to explicit lineage capture, not as a primary mechanism.

Where lineage lives: the data catalogue

Lineage metadata is most useful when it is accessible through a data catalogue — a searchable, browsable inventory of an organisation's data assets with lineage, quality, and ownership information attached.

For organisations on Azure, Microsoft Purview is the natural lineage catalogue — it connects to Azure Data Factory, Azure Synapse, SQL databases, and Power BI to capture lineage automatically, and supports scanning of on-premise sources. Purview's lineage graph is navigable in the Azure portal, with upstream and downstream views for any selected asset.

For organisations with diverse cloud and on-premise tooling, Atlan, Alation, and Collibra are widely-used catalogues that support lineage ingestion from a broader set of sources (dbt, Airflow, Tableau, Power BI, Snowflake, Databricks) and provide more sophisticated lineage search and impact analysis features.

For dbt-centric organisations, dbt's built-in lineage graph (dbt Docs) provides a visual lineage view for the transformation layer, though it does not extend to the source ingestion layer or the BI consumption layer without integration with a broader catalogue.

Column-level lineage in practice

Column-level lineage is the level of granularity that most compliance and debugging use cases require. Here is what it looks like in practice in a dbt-based architecture:

A Gold layer metric — say, net_revenue_usd in the finance.revenue_summary model — is traceable through:

1. **finance.revenue_summary** (Gold layer, dbt model): net_revenue_usd = SUM(gross_amount_usd) - SUM(refund_amount_usd), filtered to transaction_status = 'completed'

2. **finance.orders_silver** (Silver layer, dbt model): gross_amount_usd = raw_amount * exchange_rate, from the bronze ingestion + currency lookup join

3. **raw.transactions** (Bronze layer): raw_amount ingested from the payments processor API, loaded via Fivetran, refreshed hourly

4. **Payments Processor API** (source system): transaction records from Stripe, scoped to the production account

A data analyst, a compliance officer, or an AI governance team can follow this chain end-to-end. If the net_revenue_usd figure is disputed, the debug path is clear. If a payment is later flagged as fraudulent and needs to be excluded, every downstream asset that contains it can be identified and recalculated.

Common lineage implementation mistakes

**Capturing pipeline lineage but not column lineage.** Knowing that Pipeline A feeds Table B is useful for impact analysis but insufficient for debugging and compliance. Column-level lineage — which specific columns flow where — is what most governance use cases require. If you are investing in a lineage capability, invest in column-level from the start.

**Lineage in the catalogue that does not reflect current state.** Lineage that is captured once and not updated becomes misleading quickly. Source schemas change. dbt models are refactored. New pipelines are added without updating the catalogue. Lineage must be refreshed automatically (via API) or through a disciplined process triggered by code changes. Stale lineage is worse than no lineage in some respects — it misleads investigators and auditors.

**Building lineage tooling before lineage governance.** A data catalogue with lineage requires a governance process: someone reviews new data sources and ensures they are catalogued, pipeline changes trigger lineage refresh, quality issues are flagged in the catalogue. Without these processes, the catalogue becomes out-of-date and unused. Tool before process produces an expensive empty catalogue.

**Treating BI-layer lineage as optional.** The lineage path from source system to BI report is only complete if the BI layer is included. Lineage that stops at the database table and does not continue to the Tableau workbook or Power BI report misses the final consumption step — the one that is most visible to business stakeholders and most likely to be questioned.

FAQs

What is the difference between data lineage and data provenance?

Data lineage and data provenance are closely related terms that are sometimes used interchangeably. In most enterprise data contexts, lineage refers to the documented flow of data through a system (source → pipeline → destination), while provenance refers more specifically to the origin and history of a particular data record or dataset. Provenance is often used in scientific and academic contexts. For enterprise data governance purposes, lineage is the more common term and covers the same concept.

Does dbt provide complete lineage?

dbt provides lineage for the transformation layer — the dbt models that transform data from source to destination within the data warehouse or lakehouse. It does not provide lineage for: the ingestion layer (how data gets from source systems into the Bronze layer), or the BI consumption layer (how BI tools consume warehouse data). For complete end-to-end lineage, dbt's lineage graph needs to be supplemented with ingestion tool lineage (Fivetran, ADF) and BI tool lineage (Tableau Metadata API, Power BI REST API), typically via a data catalogue that integrates all three.

How does lineage support GDPR compliance?

GDPR requires organisations to know what personal data they hold, where it came from, and where it flows to — the "records of processing activities" requirement under Article 30. Data lineage, when it covers personal data flows specifically, provides the technical foundation for this requirement. When a data subject submits a right-to-erasure request, lineage identifies all downstream tables and systems that hold a copy of that person's data, enabling complete erasure. Without lineage, erasure is manual and incomplete.

What is business lineage vs technical lineage?

Technical lineage traces data at the system level — pipelines, tables, columns, transformation code. Business lineage translates this into business terms — "the revenue figure in the Q1 board report comes from the orders system, filtered to completed transactions, converted at monthly average exchange rates, and excludes returns processed after the quarter end." Business lineage is what executives and compliance teams need; technical lineage is what data engineers and systems need. Both are best maintained in the same catalogue — the business lineage layer sits on top of the technical lineage layer, using the same underlying metadata.

Our data architecture consulting practice designs and implements lineage capabilities as part of data governance programmes — from dbt-based column lineage for dbt-first architectures to enterprise catalogue implementations using Microsoft Purview or Atlan. If your organisation needs lineage for compliance, AI governance, or debugging efficiency, book a free 30-minute audit and we will assess what lineage coverage you currently have and what is missing.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →