Most modern data stacks depend on API integrations — pulling data from SaaS systems, payment processors, marketing platforms, and operational APIs that do not have native connectors. Reliable API data integration requires patterns that handle rate limits, authentication, pagination, and the inevitable changes that occur when APIs evolve without notice.
API data integration is the process of extracting data from systems that expose HTTP APIs — REST, GraphQL, or proprietary formats — and loading it into an analytical or operational data system. Most modern SaaS applications expose APIs for data access; most enterprise data stacks depend on these integrations for sales data from Salesforce, payment data from Stripe, marketing data from HubSpot, product analytics from Amplitude, or operational data from dozens of other systems.
Reliable API data integration is substantially harder than it appears from the "just call the API" starting point. The failure modes — rate limits, authentication expiry, pagination bugs, schema changes, and transient failures — are all common, and handling them correctly is what separates integrations that work reliably from integrations that produce incidents.
Authentication Patterns and Expiry
Most APIs use OAuth 2.0, API keys, or session tokens. Each has different expiry and refresh behaviour that must be handled correctly.
**API keys** are the simplest: a static credential passed in a header or query parameter. API keys typically do not expire but should be rotated periodically (quarterly is reasonable). Store API keys in a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager) — never in code or environment files committed to git. Retrieve at runtime; never log or print API key values.
**OAuth 2.0 access tokens** typically expire in 60 minutes or less. The OAuth 2.0 Client Credentials flow (for server-to-server integration without a user) exchanges client_id and client_secret for an access token. Access tokens must be refreshed before expiry; a pipeline that starts with a valid token but runs for over an hour may fail when the token expires mid-run. Implement proactive token refresh: check token expiry before each API call and refresh if the remaining validity is less than a defined threshold (5 minutes is typical).
**OAuth 2.0 with refresh tokens** (for user-delegated integrations) additionally issues a refresh token alongside the access token. Refresh tokens have longer expiry (days to weeks) but expire eventually. A pipeline that uses a refresh token that expired during an idle period will fail until the token is manually refreshed — common when a scheduled pipeline runs rarely and the refresh token expires between runs.
Always handle 401 Unauthorized responses in API integrations: detect the response, attempt a token refresh, and retry the request once. A 401 that is not retried fails the entire pipeline run for what is often a recoverable condition.
Rate Limiting
Every API has rate limits — caps on the number of requests allowed in a time window. Exceeding the limit returns a 429 Too Many Requests response. Rate limit handling is one of the most common failure modes in API integrations.
The standard handling pattern: on receipt of a 429 response, read the 'Retry-After' header (which most rate-limited APIs set to the number of seconds to wait), wait for that duration, and retry the request. If no 'Retry-After' header is present, use exponential backoff with jitter — start at 1 second, double on each successive failure, add random jitter to prevent synchronized retry storms.
Beyond reactive handling, design integrations to stay within rate limits proactively:
- Know the rate limit for each API endpoint before building the integration (documented in the API reference)
- Calculate the request volume your integration will generate per minute based on page size and total record count
- Add request throttling (time.sleep between requests) if your calculated volume approaches the limit
- Use bulk endpoints where available — one request returning 1000 records is cheaper than 1000 requests returning one record each
Pagination
APIs rarely return all results in a single response. Pagination is the mechanism for iterating through large result sets, and incomplete pagination handling is a common source of silent data loss in API integrations.
Common pagination patterns:
**Offset pagination**: Request includes offset and limit parameters; response includes total record count. Iterate by incrementing offset by limit until offset exceeds total count. Limitation: record insertions between pages can cause records to be skipped; deletions cause records to appear twice. Acceptable for stable data sets; problematic for high-churn data.
**Cursor-based pagination**: Response includes a cursor (opaque token) pointing to the next page. Request the next page by passing the cursor. More reliable than offset pagination for changing data sets but requires storing and using the cursor correctly across requests.
**Link header pagination**: Response includes a 'Link' header with URLs for the next, previous, first, and last pages. Follow the 'next' URL until there is no next link. GitHub's API uses this pattern.
**GraphQL pagination**: Typically implemented with 'after' cursor and 'hasNextPage' from a Connection type. Iterate until hasNextPage is false.
Always verify that pagination is complete: log the total records fetched and compare to the count reported by the API (if available). A pagination bug that stops after the first page loads data without errors but silently loads only a fraction of the available records.
Schema Changes and Forward Compatibility
APIs change. Fields are added, fields are deprecated, data types change, enumeration values are added. Integrations that are not designed for change break when the upstream API evolves.
Design integrations to be forward-compatible with additive changes:
- When deserialising API responses, ignore unknown fields rather than failing on unrecognised keys. A rigid deserialiser that fails when an unexpected field is present will break when the API adds a new field.
- Store the full API response payload in a raw or staging table before transforming. If the schema changes and the transformation breaks, the raw data is available for reprocessing once the transformation is updated.
- Monitor for schema changes: log field names and types on each run and alert when the response schema differs from the expected schema. Detect changes before they break downstream consumers.
Incremental Extraction
Full-extraction integrations that pull all historical records on every run are inefficient and often prohibited by API rate limits for large data sets. Incremental extraction pulls only records created or updated since the last run.
For incremental extraction, the API must support a filter by creation or update timestamp. Common patterns:
- 'created_since' or 'updated_after' parameter: filter records where timestamp > high_water_mark
- Webhook delivery: the source system pushes change events to a registered endpoint rather than being polled
Incremental extraction requires tracking the high-water mark of the last successful run in persistent storage. On the next run, pass the high-water mark as the filter parameter. Update the high-water mark only after the records have been successfully loaded — if the load fails, the high-water mark stays at the previous value and the records are retrieved again on the next run.
For APIs that do not support timestamp-based filtering, hash-based change detection can identify new or changed records by comparing hash values of current records against stored hashes, though this requires loading all records to compute hashes — effectively a full extraction with change detection.
Handling Transient Failures
API calls fail for reasons unrelated to the request content: network timeouts, intermittent server errors (5xx responses), DNS resolution failures, TLS handshake failures. All of these are transient — retrying the same request after a delay will often succeed.
Implement a retry policy for all API calls with a defined maximum attempt count (3-5 attempts is typical), exponential backoff between retries, and differentiation between transient errors (5xx, network timeout — retry) and permanent errors (4xx client errors except 429 — do not retry automatically; log and alert).
Our data engineering practice builds API integrations with proper authentication handling, rate limit management, and operational reliability — contact us to discuss your API data integration requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →