BlogData Architecture

Data Lineage Architecture: Tracing Data from Source to Dashboard

James Okafor
James Okafor
Senior Data Engineer
·August 4, 202711 min read

Data lineage is the record of where data comes from, what transformations it passes through, and where it ends up. It is the infrastructure that makes data quality debugging tractable, regulatory audit answerable, and impact analysis fast enough to use before a change ships rather than after it breaks something.

Data lineage is the record of where data comes from, what transformations it passes through, and where it ends up. It is the infrastructure that makes data quality debugging tractable, regulatory audits answerable, and impact analysis fast enough to use before a change ships rather than after it breaks something. Most data organisations understand conceptually why lineage matters; far fewer have lineage that is accurate, current, and detailed enough to be useful in practice.

What Lineage Enables

**Impact analysis** — answering the question: if I change this table or field, what downstream assets will be affected? Without lineage, impact analysis requires manually tracing dependencies, asking multiple teams, and making judgment calls about completeness. With lineage, it is a graph traversal: find the node for the changed asset, enumerate all downstream nodes, notify the owners. The difference between a change that breaks production dashboards silently and one that is coordinated properly is often the availability of accurate downstream lineage.

**Root cause analysis for data quality incidents** — answering the question: why does this metric look wrong? Tracing a metric in a dashboard back through the transformation layers to the source data is the standard debugging process for data quality problems. Without lineage, this requires tribal knowledge of the data architecture. With lineage, it is a navigation exercise: follow the graph from the broken dashboard to the data source, checking at each layer whether the anomaly is present.

**Regulatory audit** — regulations including GDPR, CCPA, HIPAA, and financial reporting requirements create obligations to demonstrate where specific data originates, how it is processed, and where it is used. For data subject access requests (DSARs) and right-to-erasure requests, lineage is the mechanism that makes it possible to find every system where a specific person's data exists. For financial audit, lineage demonstrates that reported figures are derived from auditable source systems through documented transformations.

**PII and sensitive data tracking** — understanding where personally identifiable information flows through the data stack, from the source systems that collect it through the transformation layers to the downstream analytics tables and dashboards that may expose it. Lineage from a PII perspective enables data privacy programmes to be operational rather than aspirational.

Lineage Granularity

Lineage exists at different levels of granularity, with different costs to collect and different use cases served:

**Table-level lineage** records that Table B was derived from Table A. This is the most common and easiest to collect — most orchestration tools and warehouse query logs provide table-level lineage implicitly. It is sufficient for impact analysis at the table level but insufficient for column-level debugging or PII tracking.

**Column-level lineage** records that Column Y in Table B was derived from Column X in Table A, through a specific transformation. Column-level lineage enables precise impact analysis (a change to Column X affects Column Y but not Column Z) and PII tracking (the email field in the customer table flows into the analytics staging table but was masked before reaching the reporting table). Column-level lineage is significantly more expensive to collect — it requires parsing SQL transformations to extract column dependencies, which requires either a SQL parser or instrumentation within the transformation tool.

**Row-level lineage** records which specific rows in a target table correspond to which specific rows in source tables. This is rarely stored at scale (the volume is proportional to the data itself) but is essential for data subject access requests — finding every record associated with a specific customer across all tables in the data warehouse.

Collection Mechanisms

Lineage data comes from multiple sources that need to be integrated:

**Orchestration tools** (Airflow, Prefect, Dagster) track which jobs read from and write to which tables. This provides table-level lineage for pipeline-executed transformations, with the limitation that it records the declared dependencies in the pipeline code, which may not always match the actual SQL executed.

**Warehouse query logs** capture the actual SQL executed against the warehouse. Parsing query logs extracts table-level lineage from actual queries rather than declared dependencies — more accurate but requires a SQL parser and ongoing log analysis. BigQuery and Snowflake both provide query history APIs that can be used for lineage extraction.

**dbt** generates column-level lineage natively for dbt-managed transformations. The dbt manifest includes the complete dependency graph at the model level, and dbt's column-level lineage (in dbt Cloud and through community tooling) extends this to the column level for dbt SQL models. For organisations using dbt as the primary transformation layer, dbt is the highest-quality lineage source.

**BI tools** (Tableau, Power BI, Looker) expose lineage from dashboards and reports back to the underlying data sources via their respective APIs. Tableau's Metadata API is particularly rich — it exposes, for each published view, the data sources it depends on, and for each calculated field, the underlying columns referenced. Integrating BI tool lineage with warehouse lineage creates end-to-end lineage from source system to dashboard.

Lineage in Practice

Collecting lineage is straightforward in principle; maintaining accurate lineage over time is the harder problem. Lineage that is collected once and never updated becomes inaccurate as the pipeline evolves. New tables are added; old tables are deprecated; transformations are refactored. Without continuous lineage harvesting, the lineage graph diverges from the actual data architecture.

The practical implementation:

**Automated continuous harvesting** — lineage extraction runs continuously (or daily at minimum) from each source: orchestrator logs, warehouse query history, dbt manifest exports, BI platform APIs. Manual lineage documentation does not scale and is not maintained.

**Lineage as part of the CI/CD pipeline** — for dbt and similar code-managed transformations, lineage is extracted from the transformation code and updated in the lineage catalogue on each deployment. Changes to transformation logic automatically update the lineage graph.

**Column-level lineage for high-priority assets** — full column-level lineage across an entire data platform is expensive. A practical approach is to implement column-level lineage for the tables and fields that are highest-priority for PII tracking and impact analysis, and table-level lineage for the rest.

**Alerting on lineage breaks** — when a table that has downstream dependents is dropped or renamed, alert the owners of those dependents. Lineage that enables impact analysis before a change is more valuable than lineage that enables root cause analysis after a breakage.

Our data architecture practice designs lineage infrastructure as part of broader data platform and governance implementations — contact us to discuss data lineage for your organisation.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →