Legacy data pipelines accumulate business logic over years: transformations, filters, exception handling, and adjustments that were added to solve specific problems and never documented. Modernising these pipelines requires extracting and preserving that logic — not just rewriting the SQL in a new framework.
Legacy data pipelines accumulate business logic over years: transformations, filters, exception handling, and adjustments that were added to solve specific problems and never documented. The SQL that powers a 10-year-old reporting system is not just a technical implementation — it is a record of business decisions, regulatory interpretations, and operational workarounds that are often not documented anywhere else in the organisation. Modernising these pipelines requires extracting and preserving that logic, not just rewriting the SQL in a new framework.
The Legacy Business Logic Problem
The most dangerous assumption in a legacy modernisation project is that the legacy SQL is doing what it appears to do. Legacy SQL is often doing what it appeared to need to do in the context of a specific problem, at a specific point in time, by a developer who is no longer at the organisation. The comments, if any, describe what the code does. They do not explain why — the business reason, the regulatory requirement, or the specific data anomaly it was written to address.
Common patterns of undocumented business logic:
**Magic number filters** — WHERE status_code NOT IN (5, 7, 12). These codes correspond to specific business conditions that were known at the time of writing. Today, nobody is sure what codes 5, 7, and 12 mean in the current system, or whether the conditions they represent are still relevant.
**Silent zero-handling** — ISNULL(revenue_amount, 0) or COALESCE(quantity, 0). These handle cases where null values exist in the source. If the target system handles nulls differently, or if the nulls have changed meaning, this logic may produce incorrect results silently.
**Date arithmetic workarounds** — date adjustments that compensate for fiscal year boundary handling, timezone conversions, or system-specific date encoding. These are the most common source of "everything looks right but the numbers are slightly wrong" problems after migration.
**Incremental update logic** — the conditions that determine whether a record should be inserted, updated, or skipped. Legacy incremental logic is often brittle: it works under normal conditions but fails under edge cases (late-arriving records, records with the same key but different business entity, records that need to be both inserted and updated in the same batch).
The Documentation Phase
Before any code is written in the new framework, existing pipeline logic must be documented at the business rule level — not just the SQL level. This documentation is the specification for the new implementation.
The documentation process:
**SQL analysis** — read every line of every stored procedure, view, and ETL job. For each transformation step, identify what it does in plain language. Group related transformation steps into logical units (filtering, mapping, enrichment, aggregation). Flag unclear transformations for domain expert review.
**Business rule interviews** — for each flagged transformation, interview the domain expert who understands the business context. The question is not "what does this code do?" but "what is the business reason this transformation exists?" The answer often reveals a regulatory requirement, an exception to a standard rule, or a historical data quality issue that may or may not still be present.
**Output comparison** — for each logical data entity, compare the output of the legacy pipeline against expected outputs for a representative set of test cases. Anomalies in the legacy output may reveal bugs in the legacy logic that should be corrected in the new implementation, not reproduced.
The documentation phase typically produces surprises. Transformations that appear to be standard data cleaning are revealed to be business rule applications. Filters that appear to exclude bad data are revealed to be business policy exclusions. Understanding these surprises before modernisation begins determines whether the new implementation will produce the same outputs as the legacy system — or better outputs, where the legacy logic contained bugs.
Migration Approach: Big Bang vs. Parallel Run
**Big bang migration** replaces the legacy pipeline with the new pipeline at a defined cutover point. The new pipeline is tested in a non-production environment; on cutover day, the legacy pipeline is stopped and the new pipeline starts. It is simpler operationally but higher-risk: if the new pipeline has errors not discovered in testing, the business is running on incorrect data until the errors are found and fixed.
**Parallel run migration** operates both pipelines simultaneously for a defined period, comparing outputs. Discrepancies are investigated; the new pipeline is corrected or the legacy pipeline is corrected (if the legacy logic was wrong). The parallel run continues until all discrepancies are understood and resolved, then the legacy pipeline is decommissioned.
Parallel run is the standard approach for business-critical pipelines where output correctness is verifiable — financial reporting, operational metrics used in regular decision-making, customer-facing data. It is more expensive (operating two pipelines and comparing outputs requires engineering time) but significantly more reliable.
Technology Selection for the New Pipeline
The choice of transformation framework for the new pipeline should be driven by the team's capabilities and the operational requirements, not by novelty:
**dbt** is the standard choice for warehouse-native transformation. It is appropriate when the transformation logic can be expressed in SQL, when the team has or can build SQL and dbt proficiency, and when the operational requirements (scheduling, monitoring, lineage) are met by dbt's orchestration capabilities.
**Python-based pipelines** (using Pandas, PySpark, or similar) are appropriate when the transformation logic requires procedural processing that is awkward in SQL, when the data volumes require distributed processing, or when the team's capability is stronger in Python than SQL.
**Managed ETL/ELT tools** (Fivetran, Airbyte for extraction; dbt or native warehouse transforms for transformation) are appropriate when the primary challenge is reliable extraction from diverse sources, and transformation logic is relatively standard.
The most common mistake is choosing the technology before understanding the transformation requirements — leading to discovering mid-project that the chosen technology is poorly suited to the specific transformations required.
Our data architecture practice modernises legacy data pipelines for organisations upgrading their analytics infrastructure — contact us to discuss your legacy modernisation programme.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →