What Is Data Federation? Querying Across Sources Without Moving Data

Data federation enables querying across multiple data sources as if they were a single database — without copying or moving the data. This guide explains how federated query engines work, when they are the right tool, and where they fall short compared to a centralized data warehouse.

Data federation is the ability to query across multiple, distributed data sources as if they were a single unified database — without physically copying or centralizing the data. A federated query engine sits in front of multiple data stores and handles the translation, pushdown, and result assembly transparently to the querying user.

Where a data warehouse achieves integration by copying and consolidating data into a central store, data federation achieves it by querying sources in place at query time. The data stays where it lives; the federation layer makes it appear unified.

How Federated Query Engines Work

When a federated query is submitted, the engine:

1. Parses the query and identifies which tables are in which underlying sources

2. Translates the query (or relevant portions of it) into the native query language of each source

3. Pushes as much of the filter and aggregation logic as possible down to each source ("predicate pushdown") — reducing the data that must be moved across the network

4. Retrieves the partial results from each source

5. Joins, aggregates, and assembles the final result set in the federation layer

Predicate pushdown is critical to federated query performance. If a query filters on customer_id = 12345 and joins a Snowflake table with a Postgres table, the federation engine pushes the customer_id filter to both sources before retrieving results — rather than pulling all rows from both tables and filtering locally.

Federated Query Engines

**Trino** (formerly PrestoSQL, open-source) is the most widely deployed federated query engine. It supports connectors for dozens of data stores: Hive/S3, Iceberg, Delta Lake, PostgreSQL, MySQL, Redshift, Snowflake, BigQuery, Kafka, Elasticsearch, and more. A single SQL query can join a table on S3, a Postgres operational database, and a Redshift table. Trino is the query engine behind Amazon Athena.

**Amazon Athena** — managed serverless query service built on Trino. Primarily used for querying S3 (Parquet, CSV, ORC, Avro) but also supports federated queries to RDS, DynamoDB, Redshift, and other sources via the Athena Query Federation SDK. Pay-per-byte-scanned pricing.

**Google BigQuery Omni and Federated Queries** — BigQuery supports external tables that query data in GCS, Google Drive, and Bigtable. BigQuery Omni extends this to data in AWS S3 and Azure Blob Storage, enabling cross-cloud queries without data movement.

**Databricks** — Unity Catalog supports federated queries to external data systems using Lakehouse Federation, enabling SQL queries from Databricks against Snowflake, Redshift, MySQL, PostgreSQL, and other sources.

**dbt + external sources** — dbt's external source and source freshness features do not federate at query time, but they integrate external source metadata into the dbt model graph. Actual federated query execution still requires an underlying engine with federation capability.

When Federation Is the Right Choice

**Avoiding data movement for regulatory or compliance reasons** — some datasets cannot or should not leave their original system. Patient health data stored in a HIPAA-compliant database may need to be queried for analytics but cannot be copied to a separate analytics warehouse. Federation enables analytical queries without physically extracting and loading the data.

**Reducing data duplication and pipeline complexity** — in environments where a dataset is queried infrequently and building a full ingestion pipeline is not justified, federated queries provide ad-hoc access without pipeline investment.

**Cross-system joins for operational analytics** — an operations team needs to join real-time inventory data from a MySQL operational database with historical order data in Redshift. Building and maintaining a pipeline to copy inventory data to Redshift for a join that happens infrequently may not be worth the effort. A federated query handles this at query time.

**Data virtualization for reporting** — for reporting use cases where query latency is acceptable (measured in seconds or minutes), federation provides a "virtual" unified view of data across systems without the ETL investment of physical consolidation.

Where Federation Falls Short

**Performance** — federated queries across heterogeneous systems are almost always slower than queries against a single co-located data warehouse. The network round-trips, the lack of shared statistics for query optimization, and the limited pushdown capability for complex queries all contribute to higher latency. For interactive BI workloads with sub-second response time requirements, federation is not appropriate.

**Query optimization is harder** — query optimizers for single-engine warehouses have detailed statistics about data distribution, column cardinality, and storage layout. Federated engines have partial visibility into this information, making optimal execution plan selection harder. Poorly optimized federated queries can be extremely slow.

**Data freshness and consistency** — federated queries see data as it exists in source systems at query time. If two source systems are updated at different frequencies or if a query spans data at different transaction snapshots, results may be inconsistent. Ensuring consistent point-in-time reads across multiple sources is difficult.

**Operational complexity** — maintaining connectors, managing credentials to multiple source systems, monitoring for connector failures, and debugging query failures across heterogeneous systems is more complex than managing a single warehouse.

Federation vs a Data Warehouse

These are not competing architectures — they serve different needs.

A data warehouse is appropriate when: query performance is critical, data needs to be transformed and cleaned before analysis, historical data must be preserved in an analytics-optimized form, or the analytical workload is large enough to justify the ingestion pipeline investment.

Federation is appropriate when: data cannot be moved due to compliance requirements, the query is infrequent and building a pipeline is not justified, ad-hoc cross-system access is needed, or as a complement to a warehouse for accessing data that has not yet been ingested.

Most production data environments use both: a central warehouse for the primary analytical workload, and federated query capability for edge cases and operational queries.

Our data architecture practice designs data platform architectures that combine centralized warehousing with federated query where appropriate — contact us to discuss the right integration pattern for your data environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →