Airbyte is an open-source data integration platform that provides hundreds of pre-built connectors for extracting data from APIs, databases, and files into your warehouse. Self-hosting Airbyte gives your data engineering team full control over connectors, scheduling, and infrastructure — at the cost of operational overhead that managed services like Fivetran eliminate. This guide covers when self-hosted Airbyte is the right choice and how to run it reliably.
Airbyte is an open-source data integration platform that provides a library of 300+ connectors for extracting data from APIs, databases, files, and event sources and loading it into data warehouses and data lakes. Unlike Fivetran's managed service model, self-hosted Airbyte gives your team full control over the connector runtime, scheduling, and data processing — at the cost of operational responsibility for keeping the platform running.
When Self-Hosted Airbyte Makes Sense
The decision between managed data integration (Fivetran) and self-hosted Airbyte is primarily a cost-versus-control trade-off:
Choose self-hosted Airbyte when:
- Volume economics favour self-hosting: at high data volumes, Fivetran's MAR-based pricing can exceed the cost of running Airbyte infrastructure.
- Custom connector requirements: Airbyte's connector development framework (Connector Development Kit) allows building custom connectors for internal systems or APIs that Fivetran doesn't support.
- Data residency or compliance requirements: some regulated industries require that data movement infrastructure run in their own infrastructure.
- Budget constraints: open-source Airbyte has no per-row licensing cost; the cost is infrastructure.
Choose Fivetran (managed) when:
- Operational bandwidth is limited: Fivetran manages connector updates, infrastructure, and reliability.
- Data volume is moderate: Fivetran's pricing is straightforward and predictable at lower volumes.
- Speed of setup: Fivetran connectors can be configured in minutes; Airbyte self-hosting requires infrastructure setup first.
Airbyte Architecture
Self-hosted Airbyte consists of several components:
**Server** — handles API requests, connection configuration, and the Airbyte web UI.
**Scheduler** — triggers sync jobs on configured schedules.
**Worker** — the component that runs sync jobs. Workers spawn Docker containers (one per connector) that execute the extraction and loading logic.
**Config database** — PostgreSQL database storing connection configurations, job history, and state.
**Temporal** — the workflow orchestration engine Airbyte uses internally for managing job execution.
The connector execution model — Docker containers per sync — means each connector runs in isolation with its own dependencies and runtime. This enables a large connector library without dependency conflicts.
Deployment Options
**Docker Compose (small scale)** — Airbyte ships with a Docker Compose configuration for deploying all components on a single machine. Suitable for development, small teams, and low-volume use cases. Not recommended for production: a single-machine deployment has no redundancy.
**Kubernetes (production)** — Airbyte provides Helm charts for Kubernetes deployment. Kubernetes deployment enables horizontal scaling of workers, redundancy for the server and scheduler, and integration with Kubernetes-native monitoring.
The minimum production-viable Airbyte deployment:
- Kubernetes cluster (EKS, GKE, AKS)
- Managed PostgreSQL for the config database (RDS, Cloud SQL)
- Shared storage for connector state (S3, GCS)
- Load balancer for the web UI
Connector Management
Airbyte's connector library is maintained as Docker images. Each connector has a version; updating a connector means pointing to a newer Docker image version.
**Custom connectors** — Airbyte's Connector Development Kit (CDK) provides Python and low-code frameworks for building custom connectors. A custom connector built on the CDK works identically to an official connector in Airbyte's UI and scheduler.
Custom connector use cases:
- Internal systems with proprietary APIs
- Industry-specific data sources not covered by the official library
- Sources where the official connector's schema doesn't match requirements
**Connector testing** — before deploying a new or updated connector to production, test it against a development Airbyte instance or use Airbyte's connector acceptance test framework to validate the connector's behaviour against its source.
Schema Management in Airbyte
Airbyte handles schema changes differently from Fivetran. When a source adds a column:
**Airbyte's default behaviour** — the new column appears in the next sync and is added to the destination table. This auto-propagation can be enabled or disabled per connection.
**Schema change notifications** — Airbyte can notify via webhook when a source schema changes, allowing engineering teams to review changes before they propagate to the destination. This is preferable to silent auto-propagation for destinations that downstream consumers depend on.
**Normalization** — Airbyte's basic normalization (applying column type mapping and creating properly typed tables from raw JSON) handles most transformation needs at the loading layer. For more complex normalisation, use dbt.
Operational Considerations
Running Airbyte in production requires attention to:
**Worker scaling** — each sync job runs as Docker containers on workers. High-concurrency scenarios (many connections syncing simultaneously) require adequate worker capacity. Kubernetes autoscaling helps but must be configured appropriately for your peak load patterns.
**Storage management** — Airbyte stores connector state (bookmarks for incremental syncs) in its config database and temporary sync files in shared storage. Monitor storage growth, especially for high-volume connectors.
**Connector version management** — Airbyte's connector versions change frequently. Pinning connector versions prevents unexpected connector behaviour changes; regularly reviewing release notes and updating connectors prevents falling behind on bug fixes.
**Monitoring** — Airbyte exposes metrics via the Prometheus endpoint. Integrate Airbyte metrics into your monitoring infrastructure (Grafana, Datadog) and alert on sync failures, job duration anomalies, and worker capacity issues.
**Sync failure alerting** — Airbyte's built-in alerting sends email on sync failures. For production environments, integrate Airbyte's webhook notifications with your incident management system (PagerDuty, Opsgenie).
Airbyte Cloud vs Self-Hosted
Airbyte also offers Airbyte Cloud — a managed version of the platform. Airbyte Cloud eliminates the operational overhead while retaining the connector library advantages over Fivetran. For teams that want Airbyte's connector flexibility without infrastructure management, Airbyte Cloud is worth evaluating.
The decision tree: if Fivetran covers your connectors and the pricing is acceptable, use Fivetran. If you need connectors Fivetran doesn't offer but don't want to manage infrastructure, evaluate Airbyte Cloud. If cost, control, or compliance requirements favour self-hosting, deploy Airbyte on Kubernetes.
Our data architecture practice designs data ingestion architectures including Airbyte deployment for enterprise data engineering teams — contact us to discuss your data integration strategy.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →