The situation
The client had built their Azure data estate incrementally over three years. Each new reporting requirement was met by adding a new pipeline, a new storage account, or a new compute cluster. Nobody had stepped back to review the architecture as a whole. By the time we were engaged, Azure costs were growing 65% year-over-year with no corresponding increase in business output.
The deeper problem was that analysts had stopped trusting the data. Pipeline failures were silent — when a job failed, nobody knew unless a dashboard went blank or a stakeholder complained. Data freshness was unclear. The same metric appeared with different values in different reports because different pipelines computed it differently.
The data engineering team was spending most of their capacity on incident response and reconciliation work rather than building new data products. The business had invested significantly in Azure infrastructure and headcount, but the analytics ROI had plateaued. The CFO had started asking questions.
What the assessment found
- →
23 separate storage accounts, many with overlapping content and no governance over which was authoritative. Storage costs were 40% higher than necessary due to redundant data copies.
- →
Databricks clusters were provisioned at maximum size and left running 24/7. Compute was the largest cost line and almost entirely idle during off-peak hours.
- →
No pipeline observability. Jobs ran on Azure Data Factory with no alerting, no run logging to a monitored destination, and no SLA tracking. Failures were invisible.
- →
dbt was being used in one part of the pipeline but raw SQL transformations were used elsewhere with no version control. The same business logic was implemented in multiple places with slight differences that produced metric drift.
- →
The Tableau layer was pulling from a mix of certified and uncertified data sources. Analysts had created their own workaround extracts because the official data sources were unreliable.
The restructure
We redesigned the platform around a Delta Lake medallion architecture with Unity Catalog governance. The 23 storage accounts were consolidated to a single Azure Data Lake Storage Gen2 account with proper container organisation. Redundant data copies were eliminated. Storage costs dropped immediately.
Databricks clusters were reconfigured with auto-scaling and auto-termination. Job clusters were sized to workload rather than provisioned at maximum. Spot instances were introduced for non-critical transformation work. Compute costs reduced by over 60%.
All transformation logic was migrated to dbt with standardised metric definitions across the platform. Unity Catalog replaced ad-hoc access control, providing a single governance layer for all data assets with lineage tracking. Every metric had one definition, documented in code, tested on every run.
Pipeline observability was built using Azure Monitor and a custom alerting layer that tracked job success rates, data freshness SLAs, and row-count anomaly detection. The data engineering team got a real-time dashboard of platform health — the first time they had visibility into what was actually running.
The Tableau reporting layer was rebuilt on top of the new certified gold layer. Analyst workaround extracts were decommissioned. The certified data sources became the single source of truth.
The outcome
Total Azure cost reduction was 42% within the first full billing cycle after cutover. The cost reduction came from three areas: storage consolidation (24% of the total saving), compute right-sizing (61%), and elimination of redundant jobs (15%).
With the platform stable and observable, the data engineering team shifted from incident response to product delivery. In the six months following the restructure, the team delivered four times the number of governed data products compared to the six months prior — without any additional headcount.
Pipeline observability went from zero to 100% coverage. The first month after go-live, the alerting system caught three pipeline failures that would previously have gone unnoticed until a stakeholder reported a broken dashboard. Response time to data incidents dropped from hours to minutes.