BlogData Engineering

The Analytics Engineering Toolchain: What to Use and When

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·May 2, 202712 min read

Analytics engineering has converged on a set of core tools — dbt for transformation, Git for version control, data warehouses for compute, and orchestration tools for scheduling. But the choices within each category have multiplied. This guide covers the toolchain decisions that matter and how to evaluate the options for your specific context.

Analytics engineering as a discipline has converged around a set of tools that, when used together, produce the most reliable and maintainable analytical data pipelines the industry has seen. The convergence is not total — there are meaningful choices within each category — but the core stack is recognisable: a cloud data warehouse for storage and compute, dbt for SQL-based transformation, Git for version control, an orchestration tool for scheduling, and a BI tool for visualisation.

Understanding why this stack emerged is more useful than memorising tool names. The stack exists because it applies software engineering practices (version control, testing, CI/CD, documentation) to data pipelines — practices that were absent from earlier analytics workflows and whose absence was the primary source of reliability failures.

The Data Warehouse Layer

The analytics warehouse is the platform on which everything else runs. The core choices are Snowflake, BigQuery, Redshift, and Databricks.

**Snowflake**: the default choice for organisations without a strong existing cloud vendor commitment. Near-perfect separation of storage and compute — you pay for compute only when it runs. Virtual warehouses allow multiple isolated compute environments on the same data. Strong multi-cloud capability. The economics work well for workloads with variable concurrency.

**BigQuery**: the best choice for Google Cloud organisations. Serverless — no compute to provision or manage. Pricing is per-query based on bytes processed rather than on provisioned compute time. Excellent for data lake-adjacent workloads and for organisations already using GCP services. Disadvantage: less predictable costs for high-concurrency workloads.

**Redshift**: the right choice for organisations deeply invested in AWS with existing Redshift deployments or strong data team Redshift expertise. Less architectural flexibility than Snowflake or BigQuery but well-integrated with the AWS ecosystem.

**Databricks**: the right choice when the workload includes significant data science and ML workloads alongside analytics. Databricks unifies Spark-based data engineering with SQL analytics and ML training. More complex to operate than Snowflake or BigQuery for pure analytics use cases, but eliminates the need for a separate ML platform.

The selection criteria: existing cloud vendor relationships and data team expertise matter more than abstract capability differences at the margin. Snowflake, BigQuery, and Redshift are all excellent for analytics workloads. Choose the one that fits your existing ecosystem and that your team knows.

dbt: The Transformation Standard

dbt (data build tool) has become the standard for SQL-based data transformation. It brings software engineering practices to SQL: version control, testing, documentation, modular composition, CI/CD.

The core dbt workflow: write SQL SELECT statements that define the desired output (what data should the table contain, and how is it derived from upstream sources). dbt handles the DDL (CREATE TABLE, CREATE VIEW) and materialisation strategy (table, view, incremental, ephemeral). Tests are defined in YAML alongside the models. Documentation is generated from the same YAML. The entire project is version controlled in Git.

The productivity gain from dbt relative to managing SQL transformation scripts manually is significant: tests run automatically in CI; documentation is always current because it is generated from code; refactoring is safe because tests catch regressions; new team members can understand the data model by reading the code rather than reverse-engineering undocumented SQL.

dbt Cloud is the managed service version — execution, scheduling, CI/CD, and documentation hosting included. dbt Core is the open-source version that runs anywhere but requires infrastructure management. For most teams, dbt Cloud is worth the cost for the operational simplification.

Version Control: Git is Non-Negotiable

Every piece of code in the analytics stack — SQL models, pipeline configurations, infrastructure definitions, dbt projects, Airflow DAGs — must be in Git. This is not optional. It is the foundation on which every other practice depends.

Why: reproducibility (any past state of the pipeline can be restored), auditability (every change is attributed to a person, time-stamped, and associated with a reason), collaboration (multiple engineers can work on the same codebase simultaneously), and CI/CD (automated testing and deployment triggered by Git events).

The version control workflow for analytics teams should mirror software engineering practice: feature branches for work in progress, pull request review before merge to main, CI pipeline that runs dbt compilation and tests on every PR, protected main branch requiring PR approval.

A data team that writes SQL directly to a production database and does not version control the SQL is operating without safety equipment. The absence of version control is not a maturity limitation — it is a technical debt that compounds with every change.

Orchestration: Scheduling and Dependency Management

The orchestration layer schedules and manages the execution of data pipelines. It ensures that pipeline B runs after pipeline A completes, that failure in A prevents B from running, that alerts fire when scheduled runs miss their windows, and that manual retries can be triggered without risk of double-execution.

**Apache Airflow**: the most widely adopted open-source orchestrator. DAGs (Directed Acyclic Graphs) define pipeline dependencies in Python. Extensive operator library for common tasks (dbt run, SQL queries, HTTP calls, S3 operations). Astronomer (managed Airflow) and AWS MWAA (managed Airflow on AWS) eliminate infrastructure management. The complexity of operating Airflow at scale is the primary drawback for smaller teams.

**Dagster**: modern orchestrator with stronger asset-centric model (pipelines produce assets; dependencies are between assets, not tasks). Better built-in observability than Airflow. Growing adoption but smaller community than Airflow.

**Prefect**: Python-first orchestrator with simpler deployment model than Airflow. Prefect Cloud handles scheduling and execution without requiring a self-hosted server. Good for teams that want orchestration without the Airflow operational overhead.

**dbt Cloud jobs**: for teams where dbt is the primary transformation tool, dbt Cloud's built-in scheduling may be sufficient without a separate orchestrator. dbt Cloud jobs trigger dbt runs on a schedule or via API. Limitations: less flexibility for complex multi-step pipelines with non-dbt steps.

The selection depends on team size and pipeline complexity. A small team running dbt-only pipelines may be well-served by dbt Cloud scheduling. A mature platform with complex multi-system pipelines needs a full orchestrator.

Ingestion Tools

Getting data from source systems into the warehouse is the ingestion problem. The choices are managed connectors vs custom pipelines.

**Fivetran**: the market leader for managed connectors. Pre-built connectors for 300+ SaaS applications, databases, and cloud services. The connector runs, normalises the data, and loads it to the warehouse automatically. Zero maintenance from the data team — Fivetran handles schema changes, API updates, and connector maintenance. Expensive at scale (pricing based on monthly active rows), but often cheaper than the engineering cost of maintaining custom connectors.

**Airbyte**: open-source alternative to Fivetran with a large connector library. Self-hosted (Airbyte Cloud for managed). Lower per-row cost than Fivetran but requires infrastructure management. Connector quality is variable — Fivetran connectors are generally more reliable. The right choice for teams with strong data engineering capacity who want to reduce tool costs.

**Stitch**: simpler and less expensive than Fivetran for basic use cases. Good for smaller organisations with straightforward ingestion requirements.

**Custom pipelines**: for sources without managed connectors (internal APIs, custom databases, proprietary systems), custom Python pipelines are necessary. Use SQLAlchemy for database ingestion, httpx for API ingestion, and the patterns described in production Python data engineering: idempotent loads, delete-then-insert, Pydantic validation, exponential backoff for APIs.

Transformation Beyond dbt

dbt handles SQL transformation well. For transformations that require Python — ML feature engineering, complex data cleaning that SQL cannot express cleanly, calling external APIs during transformation — two approaches:

**dbt Python models**: dbt supports Python models that run on the data warehouse's Python execution environment (Snowpark for Snowflake, BigQuery DataFrames for BigQuery). This keeps Python transformation within the dbt project, with the same testing and documentation infrastructure.

**Separate Python processing step**: for complex Python logic that does not fit the dbt model, a separate Python processing step in the pipeline — scheduled by the orchestrator, running before or after the dbt run — handles the Python work and writes results to the warehouse for dbt to pick up.

Data Quality: Where dbt Tests Are Not Enough

dbt tests are the first layer of data quality monitoring. For production data environments, additional tooling provides deeper quality coverage:

**Elementary** is an open-source dbt package that adds anomaly detection on top of dbt tests — automatically identifying unusual values, volume anomalies, and distribution shifts without requiring manually defined thresholds. It integrates directly with the dbt project.

**Monte Carlo, Accel Data (formerly Soda), and Great Expectations** are standalone data observability platforms with more comprehensive monitoring, alerting, and investigation workflows than dbt tests alone. Appropriate for larger data platforms where data quality SLAs are critical and the investment in dedicated observability tooling is justified.

Our data engineering practice designs and implements analytics engineering toolchains — contact us to discuss your data stack selection and implementation.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →