Tableau Extracts: Creation, Optimisation, and Refresh Strategy

Tableau extracts are local copies of data stored in the Hyper columnar format. They are the primary performance lever for Tableau dashboards. This guide covers extract creation, size optimisation, incremental refresh, and when to use live connections instead.

Tableau extracts are local copies of data stored in Tableau's Hyper columnar format — a high-performance analytical database designed for fast aggregation queries. An extract separates Tableau's query engine from the source database, enabling sub-second dashboard performance regardless of source complexity. This guide covers extract creation, size optimisation, refresh strategies, and when to use live connections instead.

What extracts do

When a workbook uses an extract, the data pipeline is: source database → extract (.hyper file) → Tableau view rendering. Tableau's VizQL engine sends queries to the Hyper database, not to the source. The Hyper database is columnar and designed for the exact query patterns Tableau produces — aggregations, filtered group-bys, sorted results. It is significantly faster than most source databases for these workloads.

Extracts are stored on the Tableau Server (for published data sources) or on the author's machine (for embedded extracts in .twbx files). Published extract data sources are shared across workbooks — multiple workbooks can connect to the same extract, and refreshing the extract updates data for all workbooks simultaneously.

Creating an effective extract

The default extract includes all rows and all columns from the connected table. For large tables, this is rarely optimal. Extract creation options in Tableau Desktop (Data → Extract Data):

**Filters**: Add extract filters to limit which rows are included. A sales extract filtered to the last 3 years excludes historical data that is never queried. A customer extract filtered to active customers excludes churned customers if they are not needed for analysis. Extract filters are applied at extract creation time, not at query time — they reduce the extract size permanently.

**Aggregation**: Aggregate visible measures at detail level of visible dimensions. For an hourly events table that is always analyzed at the daily level, pre-aggregating to daily during extract creation reduces extract size by potentially 24x. Only use aggregation if the analysis never requires sub-daily granularity.

**Top N rows**: Useful for creating sample extracts for development without creating full production extracts. Not useful for production extracts.

**Hidden fields**: Fields hidden in the data source (Tableau Desktop Data pane → right-click field → Hide) are excluded from extracts. Hide fields that are not needed for any analysis — raw system fields, internal IDs used only for joins, technical columns. The Hyper engine only indexes and compresses visible fields.

Extract size and performance

Extract size is the primary determinant of extract creation time, refresh time, and query performance. The target size depends on your use case — a workbook extract for a single dashboard might be 100MB; a published enterprise data source extract might be 20GB.

Factors that increase extract size:

- High-cardinality string columns (free-text notes, long descriptions, URLs)

- Many columns (especially unused ones that are not hidden)

- Many rows without appropriate filtering

Factors that improve extract compression (reduce size):

- Low-cardinality string columns (state, status, category) — highly compressible in columnar format

- Numeric columns — very efficiently stored

- Date columns — efficiently encoded

For very large extracts (10GB+), consider whether the extract is trying to serve too many workbooks with different data requirements. Multiple smaller, purpose-built extracts are often more efficient than one large general-purpose extract.

Incremental refresh

Full extract refresh drops and rebuilds the entire extract from the source database. For a 50GB extract refreshing daily, this is expensive and time-consuming. Incremental refresh updates only new rows:

Configure incremental refresh by specifying the column to use for identifying new rows (typically a created_at or updated_at timestamp, or an auto-incrementing ID) and the incremental refresh column. On each incremental run, Tableau queries the source for rows where the incremental column is greater than the maximum value currently in the extract.

**Incremental refresh limitations**: Incremental refresh only appends new rows — it does not update existing rows or delete rows. If source records are updated or deleted after initial load, the extract will be out of sync. For tables with frequent updates or deletes, full refresh is required for accuracy. Incremental refresh is appropriate for append-only event tables, log tables, and slowly growing dimensional data.

**Combining full and incremental**: Configure a full refresh weekly or monthly to correct any drift, with incremental refresh daily to capture new rows efficiently. The weekly full refresh ensures the extract is accurate; the daily incremental keeps it current without full rebuild cost.

Refresh scheduling and performance

Tableau Server's backgrounder process handles extract refreshes. Each scheduled refresh task runs on a backgrounder; if multiple refreshes are scheduled simultaneously, they queue. Schedule refreshes to complete before peak usage — typically between midnight and 5am for workday usage patterns.

For extracts that must be current as of business hours, a 4am refresh with a 5am completion target gives buffer for failures before users arrive. For extracts that update only with overnight batch loads, schedule after the batch completes rather than at a fixed time.

**Staggering refresh schedules**: Do not schedule all extracts at the same time. Multiple extracts refreshing simultaneously saturate backgrounder capacity and extend refresh duration for all of them. Stagger large extract refreshes across the overnight window.

Live connections vs extracts

**Use a live connection when**:

- Data freshness is critical — operational dashboards, real-time monitoring

- The source database is fast enough to serve Tableau queries without extract performance

- The extract would be too large to create and refresh efficiently

- Row-level security must be enforced at the database level (live connections pass the user's identity to the database; extracts cannot do this efficiently)

**Use an extract when**:

- The source database is slow for analytical queries (OLTP databases, underpowered data warehouses)

- Multiple workbooks share the same data and batch refreshes are acceptable

- The network connection to the source is slow or unreliable

- Query performance is the primary requirement

For the Tableau Server administration context, see tableau server admin guide and tableau caching strategies. Our Tableau consulting practice optimises Tableau extract architectures for enterprise deployments — book a free audit if your extract refreshes are slow or taking too long.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →