BlogData Architecture

Data Lineage: Tools, Approaches, and Why It Matters for Data Quality

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·June 17, 202712 min read

Data lineage answers the question every data team gets asked eventually: where does this number come from? When a metric looks wrong, when a regulatory audit requires proof of data provenance, when a schema change breaks a downstream dashboard — lineage is the map that makes these investigations tractable rather than requiring days of manual tracing.

Data lineage is the record of how data moves through a system: from its source, through transformations, to its final form in a dashboard, report, or ML model. At its most basic, lineage answers the question "where does this data come from?" At its most useful, it answers: "which upstream changes will affect this output?" and "which downstream outputs depend on this source?"

Lineage matters in three practical scenarios that arise in every data organisation. First, when a business user challenges a metric — "this revenue number looks wrong" — lineage enables a rapid investigation rather than hours of manual tracing. Second, when a schema change is made upstream, lineage shows the blast radius: which pipelines, data sources, and dashboards are affected. Third, for regulatory requirements (SOC 2, GDPR, financial audit, HIPAA), data lineage documents data provenance and flow as evidence of data governance controls.

Column-Level vs Table-Level Lineage

There are two granularity levels for lineage:

**Table-level (coarse-grained) lineage** records that Table B was derived from Table A. This is sufficient for blast-radius analysis: if Table A changes, Table B may be affected. It is not sufficient for impact analysis at the column level: you cannot tell which specific columns in Table B depend on which columns in Table A.

**Column-level (fine-grained) lineage** records that the 'revenue' column in 'mart_orders' is derived from the 'amount' column in 'raw_transactions' through a transformation chain. This is necessary for precise impact analysis and for satisfying regulatory requirements about specific field provenance.

Column-level lineage is harder to capture. It requires either parsing the SQL or code that produces each transformation, or instrumenting the transformation runtime to capture column-level reads and writes. Both approaches have limitations for complex transformations (subqueries, dynamic SQL, complex Python transformations).

Lineage in dbt

dbt captures table-level lineage automatically from the DAG it builds when parsing model dependencies. The 'ref()' and 'source()' macros declare dependencies explicitly; dbt's 'docs generate' command produces a full DAG visualisation of all models and their relationships.

dbt's lineage is accurate and always up-to-date because it is derived from the model code itself — not from runtime tracking. If model B references model A with 'ref("model_a")', the lineage edge exists regardless of whether the models have been run.

For column-level lineage within dbt, the dbt Metadata API (available in dbt Cloud) provides column-level lineage for models where dbt can parse the SQL and trace column references. Complex transformations (macros that generate SQL dynamically, models that use SQL the parser cannot trace) produce incomplete column-level lineage.

Lineage in Data Catalogs

Dedicated data catalog tools (Atlan, Alation, Collibra, DataHub, OpenMetadata) capture and display lineage alongside other metadata — descriptions, ownership, quality scores, usage statistics.

These tools capture lineage through multiple mechanisms:

- **API-based capture**: Query the data warehouse information schema and compare source-target table relationships in transformation jobs

- **Query log parsing**: Parse the query history from the warehouse to infer which tables read data from which other tables

- **Native connector integration**: Connect to dbt, Spark, Airflow, and other pipeline tools to capture lineage from their native metadata rather than inferring it from query logs

- **Manual curation**: Allow data owners to document lineage relationships that automated capture misses

Query log-based lineage is the most common approach for capturing lineage across heterogeneous environments (multiple tools, languages, platforms). The limitation is that query log parsing infers lineage from historical execution rather than from the authoritative definition of the transformation — meaning missing or incomplete logs produce gaps.

OpenLineage: The Open Standard

OpenLineage is an open standard for lineage metadata, sponsored by the Linux Foundation and now supported by Airflow, Spark, dbt, Flink, and several catalog tools. The standard defines a common event format that pipeline tools emit when they run — a "START" event when a job begins, a "COMPLETE" event with input and output datasets when it finishes.

Tools that emit OpenLineage events (Airflow with the OpenLineage plugin, Spark with the OpenLineage Spark integration) produce lineage metadata automatically that any OpenLineage-compatible catalog or lineage backend can consume. The backend receives events in real-time as jobs run and builds a lineage graph.

For organisations instrumenting new pipelines, designing for OpenLineage from the start provides the best path to reliable, real-time lineage. For existing pipelines, adding OpenLineage instrumentation is a migration project, but the standard's wide framework support reduces the effort compared to proprietary instrumentation.

Tableau Lineage via the Metadata API

For BI layer lineage — understanding which published data sources feed which workbooks, and which dashboards use which calculated fields — Tableau's Metadata API (GraphQL-based) provides rich lineage information.

The Metadata API exposes:

- Which data sources are connected to which workbooks

- Which fields in a data source are used by which views in connected workbooks

- Which calculated fields reference which underlying data source fields

- Database table and column information for data sources with Tableau Catalog enabled

Tableau Catalog (included in the Data Management add-on for Tableau Server, included natively in Tableau Cloud) surfaces this lineage in the Tableau UI — showing the downstream workbooks that depend on each data source, and warning when a certified data source is modified. For organisations managing large Tableau environments, Catalog lineage is operationally valuable for impact analysis before making changes.

Practical Lineage Implementation

For a practical lineage capability that covers most organisational needs:

1. **Start with dbt DAG documentation**: If you use dbt, 'dbt docs serve' provides accurate table-level lineage for all dbt models with no additional tooling. Publish the dbt docs site so the data team can reference it.

2. **Add Tableau Metadata API for the BI layer**: Use the API or Tableau Catalog to understand workbook-to-data-source dependencies and field usage. This closes the gap between the transformation layer and the consumption layer.

3. **Expand to a catalog tool when governance requirements justify it**: Column-level lineage, business metadata, data quality scores, and cross-platform lineage across multiple tools requires a dedicated catalog. Evaluate based on specific regulatory or governance requirements rather than implementing a catalog before the need is clear.

Our data architecture practice designs data lineage and governance frameworks that match your regulatory and operational needs — contact us to discuss building data lineage capability in your environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →