Data architects work with a specific set of tools — modeling, documentation, governance, lineage, and infrastructure provisioning. This guide covers the tools in each category, what they are used for, and how mature teams combine them.
Data architects work across a broader toolset than most technical roles. The job spans conceptual design, logical modeling, physical implementation, documentation, governance, and infrastructure provisioning — and each layer has its own tooling. This guide covers the tools in each category, what practitioners actually use them for, and how these tools fit together in a mature data architecture practice.
Data modeling tools
Data modeling tools help architects define entity-relationship diagrams, logical schemas, and physical database designs.
**erwin Data Modeler**: The enterprise standard for relational data modeling, particularly in regulated industries (banking, insurance, healthcare). Supports conceptual, logical, and physical models; forward and reverse engineering from databases; and model comparison and synchronisation. Expensive and complex; most appropriate for large organisations with formal data governance requirements.
**Lucidchart / draw.io**: Used for conceptual and logical ER diagrams where precision matters less than communication. Most architects use these for whiteboard-style entity modeling, communicating designs to stakeholders, and documenting high-level architecture. Free (draw.io) or low-cost (Lucidchart). Not suitable for generating DDL or managing physical models.
**dbt**: The de facto logical-to-physical modeling tool in modern data engineering. SQL models with YAML schema definitions are effectively a code-first approach to physical data modeling. Many organisations have replaced traditional modeling tools with dbt for the warehouse layer because dbt is version-controlled, testable, and deployable via CI/CD.
**SqlDBM / Vertabelo**: Cloud-based ER diagramming tools that connect to databases and generate DDL. More lightweight than erwin; appropriate for smaller teams that need diagramming plus schema management without enterprise tool complexity.
Data catalog and metadata management
Data catalogs provide discovery and documentation for data assets — what tables exist, what they contain, who owns them, and what business concepts they represent.
**Alation**: Enterprise data catalog with lineage, stewardship workflows, and a query editor. AI-assisted documentation and classification. Most common in mid-market and enterprise financial services and healthcare environments. Expensive.
**Collibra**: Enterprise data governance platform with deep workflow capabilities — data stewardship, policy management, issue tracking, and glossary management. More governance-oriented than Alation; requires significant configuration. Most appropriate for heavily regulated industries.
**Atlan**: Modern data catalog with strong dbt and modern data stack integration. Active metadata — lineage is pulled from dbt, Airflow, and query logs automatically. Collaboration features. Better fit for teams using dbt and modern tools than legacy catalog tools.
**dbt Docs**: For teams already on dbt, dbt Docs generates a documentation site from model descriptions, column-level YAML, and lineage from the project's DAG. Not a full catalog — no search across multiple tools, no stewardship workflows — but zero-cost documentation generation for the dbt project.
**OpenMetadata / DataHub**: Open-source data catalog platforms. DataHub originated at LinkedIn. Both support multi-source metadata ingestion, lineage, and governance workflows. Appropriate for organisations that want catalog capabilities without commercial tool cost and have engineering capacity to run the platform.
Data lineage tools
Lineage tools map data flow from source to consumption — which tables feed which, which reports depend on which data sources, what breaks if a column is renamed.
**Monte Carlo**: Data observability platform that detects data quality anomalies and maps lineage from SQL queries, dbt models, and pipeline metadata. More focused on observability than governance — alerts when data deviates from expected patterns.
**Datafold**: Data diff and lineage tool focused on CI/CD for data pipelines. Compares data across environments (development vs production) before deployment. Lineage from dbt, query logs, and data source connections. Particularly useful for validating dbt model changes before merging.
**Built into dbt, Spark, Airflow**: Modern tooling often provides lineage natively. dbt's manifest.json contains full DAG lineage for the dbt project. Airflow captures task-level lineage via OpenLineage integrations. Spark lineage is available through Delta Live Tables or Databricks Unity Catalog lineage. Assembled together, these provide most of the lineage a team needs without a dedicated lineage tool.
Infrastructure provisioning
**Terraform**: The standard tool for provisioning cloud data infrastructure as code — Snowflake databases and warehouses, BigQuery datasets, Databricks clusters, S3 buckets and IAM roles. Terraform's state management tracks what exists and applies only changes, preventing manual configuration drift. HCP Terraform (managed) reduces operational complexity for smaller teams.
**Pulumi**: Infrastructure as code using general-purpose programming languages (Python, TypeScript) rather than Terraform's HCL. Better for teams that prefer their infrastructure configuration to be maintainable Python rather than a domain-specific language.
**Cloud-native IaC (CDK, Bicep)**: AWS CDK (TypeScript/Python), Azure Bicep, and Google Cloud Deployment Manager are cloud-specific IaC tools. More native to each cloud's concepts; less portable across clouds than Terraform. Appropriate for teams committed to a single cloud.
Version control and CI/CD for data
**Git (GitHub, GitLab, Bitbucket)**: Version control for dbt models, Terraform configurations, pipeline code, documentation, and data quality tests. This is table stakes — all data engineering artifacts should be in version control.
**GitHub Actions / GitLab CI**: CI/CD pipelines for data code. A well-designed data CI/CD pipeline runs dbt tests on a development environment, validates Terraform plans before apply, runs data quality checks on affected models, and creates automated pull request environments with Datafold or dbt Cloud's CI integration.
**dbt Cloud**: Managed dbt with a built-in CI/CD runner, documentation hosting, and IDE. For teams using dbt, dbt Cloud is the primary deployment and orchestration surface. CI jobs run on pull requests; production jobs run on schedule.
Query and exploration tools
**DBeaver**: Free, open-source database client supporting virtually every database and warehouse. Used for ad-hoc SQL querying, schema exploration, and DDL management. Standard on every data engineer's machine.
**DataGrip (JetBrains)**: IDE-quality SQL editor with intelligent completion, schema navigation, and execution history. More powerful than DBeaver for intensive SQL development; paid.
**Jupyter notebooks**: For analytical prototyping and exploratory data analysis on the data the architecture produces. Not an architecture tool per se, but data architects use them for profiling data quality, testing schema designs, and documenting analytical use cases.
The mature architecture team's toolset
A mature data architecture practice typically combines: dbt for physical modeling and transformation, Terraform for infrastructure, GitHub + CI/CD for version control and deployment, one catalog tool (Atlan, DataHub, or Collibra depending on budget and governance requirements), and Lucidchart or draw.io for communicating designs. The commercial catalog and governance tools add significant value at enterprise scale; at startup or scale-up scale, dbt Docs plus GitHub is often sufficient.
For the architectural principles behind these tools, see data governance framework and data lineage. Our data architecture consulting practice designs tooling stacks for data organisations at all levels of maturity — book a free architecture review.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →