BlogData Architecture

What Is Data Lineage? Tracking Data From Source to Dashboard

Obed Tsimi
Obed Tsimi
Founder & Lead Tableau Architect
·March 8, 202810 min read

Data lineage documents the journey of data from its source systems through transformations to its final destination in reports and dashboards. This guide explains why lineage matters, how modern tools like dbt and DataHub capture it automatically, and what regulated industries require.

Data lineage is the documented record of data's journey through a system — from its origin in source systems, through the transformations applied to it, to its final appearance in reports and dashboards. It answers the question: where did this number come from, and what happened to the data along the way?

Lineage is how you answer "why did this metric change last week?" Lineage is how a data breach investigation determines which downstream reports were affected. Lineage is how a regulated institution demonstrates to an auditor that the revenue figure in its board report can be traced to validated source transactions.

Column-Level vs Table-Level Lineage

Lineage operates at different granularities:

**Table-level lineage** documents which tables feed into which other tables. Table A is transformed into Table B, which is joined with Table C to produce Table D, which populates Dashboard E. This is the coarser level — sufficient for understanding broad data flows and identifying which pipelines are upstream of a given report.

**Column-level lineage** tracks individual fields through the transformation chain. The "net revenue" column in the finance dashboard traces back through three dbt models, joining two source tables, applying a specific business rule for refund handling, and originating from the "amount" field in the transactions table. Column-level lineage is what you need to answer precise questions about specific metrics and to conduct impact analysis when a source column changes.

Column-level lineage is significantly harder to capture than table-level lineage, because it requires parsing SQL transformation logic to understand how output columns are derived from input columns — not just which tables are connected.

How Modern Tools Capture Lineage

**dbt** (data build tool) generates lineage automatically as a byproduct of how transformations are defined. Each dbt model is a SQL SELECT statement that references other models using the ref() function. dbt parses these references to build a complete DAG (directed acyclic graph) of table dependencies. The dbt docs site generated from a project includes a visual lineage graph for every model. Column-level lineage is available in dbt Cloud's Explorer — it parses the SQL in each model to trace individual column derivations.

**Apache Atlas** is an open-source metadata and governance framework, widely used with the Hadoop ecosystem. Atlas captures lineage from Hive, Spark, and Sqoop operations by hooking into these systems' execution hooks. It provides a REST API and UI for browsing lineage graphs.

**DataHub** (open-sourced by LinkedIn) is a modern data catalog and metadata platform with strong lineage capabilities. DataHub ingests metadata from data warehouses (Snowflake, BigQuery, Redshift), transformation tools (dbt, Spark), and BI tools (Tableau, Looker, Power BI) to build a unified lineage graph across the entire data stack. Its column-level lineage support for Snowflake and BigQuery is among the most complete available.

**Marquez** (open-sourced by WeWork, now a Linux Foundation project) implements the OpenLineage standard — an open specification for capturing lineage events from any data pipeline. OpenLineage clients for Airflow, Spark, dbt, and other tools emit lineage events in a standard format, which Marquez or other compatible platforms collect and visualize.

**Monte Carlo, Atlan, Alation** are commercial data observability and catalog platforms that include lineage as part of broader data governance and observability products.

Regulatory Requirements for Lineage

Financial services, healthcare, and other regulated industries face explicit lineage requirements:

**BCBS 239** (Basel Committee on Banking Supervision) requires financial institutions to be able to trace any data element in a risk report back to its source with full documentation of transformations. "Risk data aggregation and risk reporting" at a systemic bank requires end-to-end lineage.

**GDPR and CCPA** data privacy regulations require organizations to know where personal data is stored and how it flows through systems. Impact analysis when a data subject requests deletion requires lineage — you need to know every downstream table and report that contains the individual's data.

**SOX (Sarbanes-Oxley)** financial reporting controls require that financial figures in public company reports are auditable. Auditors expect to trace reported revenue, expense, and balance figures to validated source transactions.

**HIPAA** requires knowing where protected health information is stored and accessed — which requires lineage documentation for any system processing PHI.

In regulated contexts, lineage is not a convenience feature — it is an audit requirement.

Lineage for Impact Analysis

One of the most practical applications of lineage is impact analysis: understanding what will break or change if a source system changes.

A source system is deprecating the "legacy_customer_id" column and replacing it with a new "customer_uuid" field. Before making this change, you need to know: which tables use this column? Which dbt models reference it? Which dashboards display metrics derived from it? Which scheduled reports will break?

Without lineage, answering this question requires manually searching SQL code across dozens of transformation files — slow, error-prone, and often incomplete. With column-level lineage, the answer is a graph query: show me everything downstream of this column. The result is a precise list of affected assets, enabling a migration plan rather than a breakage incident.

Lineage for Debugging

When a metric changes unexpectedly, lineage provides the diagnostic starting point.

The monthly active users metric dropped 15% this week. Is this a real change in user behavior, or a data pipeline problem? Lineage shows the path: the metric is computed in a Tableau view, sourced from a dbt model that aggregates a staging table, which receives data from a Fivetran sync of the events database. Checking each stage in the lineage graph — sync logs, dbt model run timestamps, row counts at each stage — localizes whether the problem is upstream (the sync failed) or in the transformation (a filter was changed).

Without lineage, this investigation is unstructured: grep for the metric name, ask colleagues, check pipeline logs without knowing which ones are relevant. With lineage, the investigation is systematic.

Implementing Lineage in Practice

The most practical starting point for most analytics teams is dbt lineage — which comes for free with any dbt project. The dbt docs site provides table-level lineage, and dbt Cloud Explorer adds column-level. For teams running dbt, this covers the transformation layer completely.

For broader lineage across ingestion tools and BI tools, DataHub with OpenLineage integration provides the most complete open-source coverage. It requires setup and maintenance but provides a unified view of lineage across the full stack.

Commercial platforms (Monte Carlo, Atlan) bundle lineage with data observability, cataloguing, and quality monitoring — appropriate for organizations that need a fully supported product rather than managed open-source infrastructure.

Our data architecture practice designs data governance and lineage frameworks for regulated and complex analytical environments — contact us to discuss your lineage and governance requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →