BlogPower BI

Power BI Dataflows: Reusable Data Preparation with Power Query Online

Obed Tsimi
Obed Tsimi
Founder & Senior Tableau Architect
·July 19, 20269 min read

Power BI dataflows are cloud-based ETL using Power Query Online — reusable data preparation steps that run on refresh and produce tables available to multiple Power BI datasets. This guide covers when to use them, how they work, and the governance model.

Power BI dataflows are cloud-based data preparation pipelines built with Power Query Online. They extract data from sources, transform it using Power Query's M language, and produce tables stored in Azure Data Lake Storage Gen2 — tables that multiple Power BI datasets can reference without each dataset doing its own data preparation. This guide covers what dataflows are, when they make sense, how to build them, and the governance model for managing them at scale.

What a dataflow actually is

A dataflow is a collection of entities (tables) defined by Power Query transformation steps, stored in a workspace. When you refresh a dataflow, it runs the Power Query steps against the source data and stores the result in Azure Data Lake Storage Gen2, in Common Data Model (CDM) format. Power BI datasets that connect to the dataflow read from the stored tables rather than running the extraction and transformation themselves.

The key distinction from a regular dataset connection: with a direct connection, each dataset extracts and transforms data independently. With a dataflow, the extraction and transformation happen once, in the dataflow, and multiple datasets consume the output. Shared transformation logic, executed once.

Why use dataflows

**Reusability**: If 10 Power BI datasets all need the same customer dimension — standardised name formatting, a joined segment column, deduplicated records — each dataset doing this transformation independently is wasteful and inconsistent. A dataflow defines the transformation once; 10 datasets reference the dataflow entity.

**Performance**: Dataflow entities are pre-computed and stored in ADLS. A dataset importing from a dataflow reads pre-computed tables rather than running transformation logic against the source at import time. For complex transformations against slow sources, this can significantly reduce dataset refresh time.

**Separation of concerns**: Data preparation (extracting from sources, cleaning, standardising) is separated from data modeling (defining measures, relationships, hierarchies). The dataflow team maintains shared tables; the dataset author builds analytical models on top of them.

**Reduced source system load**: Instead of 10 datasets each querying the source database at 6am, one dataflow queries the database at 5:45am and all datasets import from the dataflow output. Reduces concurrent source queries.

Dataflow types

**Standard dataflows**: Refresh on demand or on schedule. Results stored in ADLS Gen2 behind the scenes (managed by Microsoft). Available in any workspace with Power BI Pro or Premium Per User licences.

**Premium dataflows (with DirectQuery)**: Require Premium Per User or Premium capacity. Enable DirectQuery against the dataflow entities (rather than only import), computed entities (entities derived from other dataflow entities using Power Query without re-reading the source), and linked entities (reference an entity from another dataflow without copying it). Premium dataflows also support incremental refresh — refreshing only new or changed records rather than the full entity.

For most organisations, standard dataflows plus import mode datasets are sufficient. Premium features become relevant when datasets are too large for full import refresh or when incremental processing is required.

Building a dataflow

In Power BI Service: go to the workspace → New → Dataflow. The dataflow editor is Power Query Online — the same M-based editor as Power BI Desktop's query editor, but running in the browser.

Define entities by connecting to data sources and applying transformation steps. Each entity is a named output table. The transformation steps are M queries — the same language as Power Query in Power BI Desktop, Excel Power Query, and Azure Data Factory's Power Query activity.

Best practices for building dataflows:

**Staging entities before transformation entities**: Create a staging entity that mirrors the source exactly (minimal transformation), then a separate entity that applies business logic on top of the staging entity. This mirrors the medallion architecture pattern and makes the source data auditable without running full source extraction again.

**Use parameters for environment configuration**: Parameterise connection strings, server names, and date filters so the dataflow can be promoted from development to production workspaces without manual editing.

**Document each entity**: Add descriptions to entities and columns. Power BI Service surfaces these descriptions in dataset connections and in Purview catalog scans.

Dataflows and Purview

Microsoft Purview scans Power BI workspaces and indexes dataflow metadata — entity definitions, transformation logic, and column descriptions. If your organisation uses Purview for data governance, well-documented dataflows become searchable, governable assets in the catalog.

The lineage view in Purview traces data from the dataflow source through to reports — a complete end-to-end lineage for any Power BI artifact built on a dataflow.

When dataflows are not the right answer

**When only one dataset uses the data**: The overhead of building and maintaining a dataflow is not justified if only one dataset benefits. Just connect the dataset directly to the source.

**When the transformation is complex enough to need code**: Power Query Online can handle complex transformations, but some are better expressed in Python, SQL, or dbt. If you find yourself fighting Power Query to express business logic, the logic should live in the data warehouse — in a dbt model or stored procedure — not in a Power Query dataflow.

**When the data volume exceeds Power Query's capacity**: Very large datasets (hundreds of millions of rows) may exceed what Power Query handles efficiently. For those volumes, the transformation and aggregation should happen in the data warehouse, and the dataflow (or dataset) should import pre-aggregated data.

Governance model

Dataflows live in Power BI workspaces. Workspace access controls determine who can build, edit, or read dataflows. In large organisations, shared or certified dataflows should live in dedicated workspaces with restricted write access — only the data team can modify them. Report authors have read access to use dataflow entities in their datasets.

Certified dataflows (marked with the Certified badge) signal that a dataflow has been reviewed and approved as an authoritative source. The certification workflow requires a Power BI admin or a user with endorsement permissions to certify the dataflow. Certified entities appear first in search results and are visually differentiated in the dataflow entity picker.

For the broader Power BI architecture context, see power bi deployment guide and power bi performance optimization. Our Tableau consulting practice also covers Power BI architecture and deployment — book a scoping call to discuss your self-service analytics requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →