Salesforce Data Integration: Connecting CRM Data to Your Analytics Stack

The architecture patterns for integrating Salesforce into a cloud data warehouse — Fivetran and Airbyte connectors, direct Salesforce API extraction, handling incremental sync, custom object challenges, and the data model design decisions that determine whether Salesforce data is analytically useful or just technically present.

Salesforce is one of the most common data sources in enterprise analytics stacks, and one of the most reliably painful ones to integrate. The Salesforce data model is complex, the API has rate limits and data volume constraints that catch teams by surprise, custom objects and fields proliferate differently in every implementation, and the incremental sync pattern requires careful design to avoid duplicates and missed updates.

This guide covers the practical architecture of Salesforce to data warehouse integration — the ingestion options, incremental sync design, custom object handling, and the data model decisions that determine whether Salesforce data is genuinely analytically useful.

Ingestion Options

**Fivetran Salesforce connector** is the most widely deployed managed option. Fivetran handles the Salesforce API interaction, change detection (using the SystemModstamp field for incremental sync), schema mapping to warehouse tables, and automatic adaptation to new custom objects and fields. It syncs all standard and custom objects you select into your warehouse, one table per Salesforce object.

Fivetran's Salesforce connector uses the Salesforce Bulk API 2.0 for initial loads and large syncs, and the REST API for incremental updates. It handles the Salesforce 30-day systemic change tracking window (the Salesforce API only returns changes from the last 30 days via delta queries) by maintaining its own change log.

The practical limitation: Fivetran's Salesforce schema is flat — each Salesforce object becomes a table, but relationships between objects (the Lookup and Master-Detail relationships that define the Salesforce data model) are represented by ID columns. You still need to model the relationships in your warehouse or dbt project.

**Airbyte Salesforce connector** is the open-source alternative. Functionally similar to Fivetran for most use cases — it uses the Salesforce Bulk API for extraction, handles custom fields and objects, and syncs incrementally. Airbyte is self-hosted (or Airbyte Cloud as managed), which gives more control over the sync schedule and schema customisation but requires operational management.

**Direct Salesforce API extraction** is appropriate when you need more control than a managed connector provides — custom field selection, transformation during extraction, or extraction triggered by Salesforce Platform Events. Implementation options:

- Salesforce REST API (SOQL queries over HTTPS, suitable for targeted extractions and low-volume sync)

- Salesforce Bulk API 2.0 (designed for large data exports — up to 150MB per batch, async processing)

- Salesforce Streaming API / Platform Events (event-driven, near-real-time delivery of record changes)

- Salesforce SOAP API (legacy, rarely used for new integrations)

For data engineering teams implementing direct API extraction, the simple-salesforce Python library provides a clean interface to the REST and Bulk APIs.

**Salesforce Data Export service** is a manual option — Salesforce generates weekly CSV exports of all data that can be downloaded from the Salesforce UI. Not suitable for production analytics pipelines but useful for ad-hoc historical data loads or initial migration seeding.

Incremental Sync Design

Full re-extraction of Salesforce data is expensive and slow for large orgs — an Account table with 500,000 records takes minutes via the Bulk API and may hit rate limits when run multiple times per day. Incremental sync is required for production pipelines.

**SystemModstamp-based sync** is the standard approach. Every Salesforce standard object has a SystemModstamp field — a timestamp updated whenever any field on the record changes. An incremental query:

SELECT Id, Name, AccountNumber, Industry, ... FROM Account

WHERE SystemModstamp >= 2024-03-15T06:00:00Z

ORDER BY SystemModstamp ASC

Fivetran and Airbyte both use this pattern. The warehouse-side implementation stores the high-water mark (the maximum SystemModstamp from the previous sync) and passes it to the next sync query.

**Soft deletes:** Salesforce does not hard-delete records by default — it marks them with IsDeleted = true. Standard incremental sync does not capture these unless you query the Recycle Bin explicitly or use the Salesforce recycle bin API. If historical accuracy matters (a deleted contact should not still appear as active), include IsDeleted in your sync and filter it appropriately in your transformation layer.

**Hard deletes:** When records are permanently deleted (emptied from the Recycle Bin), they disappear from all API queries. Fivetran handles this with a nightly full sync comparison. For direct API implementations, the Salesforce getDeleted() API returns permanently deleted records within a time window. If you need to track deletions, call this API periodically and mark deleted records in your warehouse.

Custom Objects and Fields

Every Salesforce org accumulates custom objects (suffixed with __c — e.g., Contract__c, Project__c, Territory__c) and custom fields on standard objects. These are org-specific and undocumented outside the Salesforce configuration.

**Discovery:** Use the Salesforce Metadata API or Schema Explorer (in SOQL: DESCRIBE on an object) to programmatically enumerate custom objects and fields. Fivetran and Airbyte do this automatically during connector setup.

**Naming:** Custom object API names are salesforce_object_name__c; custom field API names are field_name__c. These API names are what appear in queries and what your warehouse tables will be named (minus the __c suffix in some connectors). The user-visible label (what appears in the Salesforce UI) may differ from the API name — document the mapping.

**Poorly designed custom objects:** Many Salesforce implementations have custom objects that were created for operational purposes and are not analytically clean — missing foreign keys, inconsistent picklist values, free-text fields used as structured data. Identify this during the integration design phase and design appropriate transformation logic to handle it rather than expecting clean data from the source.

Data Model Design for Analytics

Raw Salesforce tables in the warehouse are not analytically useful. The Salesforce data model is optimised for CRM operations, not analytics. Useful analytics requires modelling:

**Unified customer/account model.** Salesforce has Account (company), Contact (person), and Lead (pre-conversion prospect). Analytics often needs a unified model: all people who interact with the company, regardless of whether they are a Contact (post-conversion) or Lead (pre-conversion). Build a unified customer dimension that merges these, handling the Lead-to-Contact conversion event.

**Opportunity pipeline model.** The Salesforce Opportunity object tracks deals. For pipeline analytics, you need: current pipeline value by stage, stage velocity (average time per stage), conversion rates between stages, and historical snapshots (what did the pipeline look like last quarter vs now). The raw Opportunity table gives current state; SCD Type 2 snapshots are required for accurate historical pipeline analysis.

**Activity timeline.** Task and Event objects in Salesforce represent all logged activities — calls, emails, meetings. For activity-based analytics (which accounts have been touched recently, what is the cadence of outreach by rep), you need a unified activity timeline model that joins Task and Event to the Account, Contact, and Opportunity they are related to.

**Picklist standardisation.** Salesforce picklist fields (Industry, LeadSource, Stage) accumulate historical values from previous configurations — values that were renamed or removed still appear on old records. Standardise these in the transformation layer with explicit mapping tables rather than passing raw picklist values to analytical models.

Common Integration Failures

**Rate limit exhaustion.** Salesforce API rate limits vary by edition (typically 100–1,000 API calls per user per 24 hours for REST API; Bulk API is separate). Managed connectors handle rate limit backoff automatically. Direct API implementations must implement exponential backoff and rate limit monitoring.

**Missing incremental capture for junction objects.** Many Salesforce junction objects (OpportunityContactRole, AccountContactRelation) do not have SystemModstamp fields or have them behave inconsistently. Full re-sync of these small tables is often the right approach.

**Timezone handling.** Salesforce stores DateTime fields in UTC. The Salesforce UI converts to the user's timezone. Queries return UTC. Your warehouse should store all timestamps in UTC and handle timezone conversion at the reporting layer — not in extraction.

**Formula field limitations.** Salesforce formula fields are computed in the Salesforce UI but are not always available via the Bulk API. Test formula field availability in your Salesforce edition before relying on them in extraction.

Our data engineering consulting practice designs and implements Salesforce data integrations as part of broader analytics stack builds — contact us to discuss your Salesforce integration requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →