Trino: Federated Query Engine for Multi-Source Analytics

What Trino (formerly PrestoSQL) is, how it enables SQL queries across Hive, S3, Snowflake, PostgreSQL, and other sources without data movement, and when federated query makes sense versus ETL into a central warehouse.

Trino (originally Facebook's Presto, forked and renamed) is an open-source distributed SQL query engine designed to query data wherever it lives — data lakes on object storage, relational databases, NoSQL stores, cloud data warehouses — without moving the data to a central location. A single Trino query can join a Hive table on S3, a PostgreSQL table, and a Snowflake view and return a result set from all three.

Understanding what Trino is good at, where it falls short, and when federated query is the right architectural choice versus ETL into a central warehouse is essential for data architects designing analytical platforms.

What Trino Is

Trino is a massively parallel processing (MPP) SQL query engine. It does not store data — it queries data in-place using connectors that translate Trino SQL into the native query language of each source system. Trino's coordinator distributes queries across worker nodes, which execute sub-queries against data sources in parallel and return results to the coordinator for final assembly.

Key connectors include:

- Hive (Parquet, ORC, Avro on S3/ADLS/GCS via Hive Metastore or AWS Glue)

- Iceberg (native Iceberg tables)

- Delta Lake (Delta format on object storage)

- PostgreSQL, MySQL, SQL Server (relational databases)

- Snowflake, BigQuery, Redshift (cloud data warehouses)

- Kafka (streaming data)

- MongoDB, Elasticsearch (NoSQL)

- HTTP (REST APIs as virtual tables)

How Trino Differs from a Data Warehouse

A cloud data warehouse (Snowflake, BigQuery) stores your data in its own managed storage and executes queries against that storage. Trino does not store data — it is a query engine only. You bring your storage; Trino brings the query execution.

This distinction matters for:

- **Latency**: queries against data in object storage via Trino have higher latency than queries against a cloud warehouse's native storage, which is co-located with compute

- **Query pushdown**: Trino pushes predicates and aggregations to data sources where possible (pushdown to Snowflake's native engine, pushdown to PostgreSQL). The degree of pushdown depends on the connector; non-pushdown queries pull data to Trino workers and process it there

- **Cost**: no warehouse storage costs; pay for Trino cluster compute only. For organisations that already have data in object storage, Trino adds query capability without duplicating data into a warehouse

- **Governance**: access control must be configured in Trino (and separately in each data source), not in a single governance platform

When Federated Query Makes Sense

**Cross-source analytical queries without ETL.** If the business question requires joining data from a PostgreSQL operational database, a Snowflake data warehouse, and a Parquet lake on S3, the alternative to Trino is three separate ETL pipelines to centralise the data before querying. Trino allows the query now, without the pipeline investment.

**Data exploration during migration.** When migrating from one platform to another (on-premises Hive to cloud warehouse), Trino enables querying both the old and new data during the transition — validating that migration is correct by comparing results between systems.

**Avoiding data warehouse vendor lock-in for lake data.** Organisations that store data in open formats (Parquet, Iceberg) on object storage can use Trino to query it without committing to a specific warehouse. If the commercial relationship changes, the query engine changes; the data does not need to move.

**Multi-warehouse queries.** Organisations with data distributed across multiple cloud accounts (one Snowflake account per business unit, multiple regional data warehouses) can use Trino to federate queries across accounts — though latency is higher than native warehouse queries.

When to Use ETL Into a Central Warehouse Instead

For most analytical workloads, centralising data via ETL into a cloud warehouse produces better outcomes than federated query:

**Performance predictability.** Federated query performance depends on network latency, source system load, and connector efficiency — all variable. Cloud warehouse query performance is more predictable and generally faster for analytical workloads.

**Query complexity.** Complex analytical queries (multiple joins, aggregations, window functions) execute more efficiently when all data is in the same system than when data must be transferred across network.

**Governance.** A central warehouse with well-defined access controls, row-level security, and audit logging is simpler to govern than a federated architecture where data governance is distributed across multiple source systems.

**Use federated query as a bridge, not a destination.** The most effective use of Trino in mature data architectures is as a temporary bridge — enabling queries before ETL is complete, enabling exploration without commitment — not as a permanent replacement for centralised analytical data.

Trino vs Starburst vs Athena

**Starburst** is the commercial distribution of Trino, maintained by the company founded by Trino's original developers. Starburst adds enterprise features to open-source Trino: enhanced security (OAuth, LDAP, column masking), data products (logical tables with governed access), and managed deployment options (Starburst Galaxy cloud service, Starburst Enterprise for self-managed).

**Amazon Athena** is AWS's managed Trino service. Athena runs queries against S3 using the Trino engine — no cluster to manage, pay per query. Athena is the appropriate choice for ad-hoc queries against S3 data on AWS without managing Trino infrastructure. For cross-source federation or higher-performance interactive analytics, self-managed Trino or Starburst Galaxy provides more control.

**Dremio** is a related but distinct product — a semantic layer and query engine with a different architecture (data reflection/virtualisation model). For organisations that want Trino's federation with a stronger semantic layer and managed service, Dremio is worth evaluating.

For data architecture design including federated query and lakehouse architecture on AWS and other platforms, our data architecture consulting practice can advise on the right approach — contact us to discuss your requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →