AWS Glue for ETL: Serverless Data Integration on AWS

AWS Glue is the managed ETL and data integration service on AWS — serverless Spark execution, a built-in data catalog, schema crawlers, and connectors to S3, RDS, Redshift, and hundreds of SaaS sources. This guide covers Glue job types, the Glue Data Catalog, DynamicFrames vs DataFrames, and when Glue is the right choice versus self-managed Spark or Airflow.

What AWS Glue Is

AWS Glue is a managed extract-transform-load (ETL) service on AWS — serverless Spark execution, a built-in metadata catalogue, automatic schema detection via crawlers, and connectors to S3, RDS, Redshift, DynamoDB, Kafka, and hundreds of SaaS sources.

For teams building data pipelines on AWS, Glue eliminates cluster provisioning and management. You write transformation code; Glue handles the compute. The Data Catalog integrates with Athena, Redshift Spectrum, and EMR, providing a shared schema layer across query engines.

Glue Components

**Data Catalog**: A managed Hive Metastore-compatible metadata repository. Stores database and table definitions — schema, partitions, storage location, SerDe properties. When you create a Glue table pointing at S3 Parquet files, Athena, Redshift Spectrum, and EMR can all query those files using the Catalog schema without duplicating definitions.

**Crawlers**: Automated schema detection. Point a crawler at an S3 prefix, JDBC database, or DynamoDB table; the crawler samples data, infers schemas, detects partitions, and creates or updates Catalog table definitions. Useful for onboarding new data sources without manual schema authoring. Schedule crawlers after each data load to detect schema changes.

**Jobs**: The ETL execution unit. A Glue job runs Spark code (Python or Scala) on a managed cluster that Glue provisions, runs, and tears down. Three job types:

- Spark: standard distributed Spark execution for large-scale transformation

- Streaming: continuous Spark Structured Streaming for real-time data processing from Kinesis or Kafka

- Python Shell: single-node Python execution for lightweight scripts, small datasets, API calls

**Connections**: Encrypted credential stores for JDBC sources, Kafka, and other external systems. Referenced by crawlers and jobs without embedding credentials in code.

**Triggers**: Scheduling and event-based job execution. Scheduled triggers run jobs on cron schedules; event-based triggers fire when another job succeeds or fails; on-demand triggers fire via API.

**Workflows**: Visualised multi-job pipelines with conditional execution based on job outcomes. Alternative to Airflow for simpler pipeline DAGs that stay within the Glue console.

Glue Job Types in Practice

### Spark Jobs

Standard Glue Spark jobs run on managed Spark clusters. You choose the worker type (G.1X, G.2X, G.4X, G.8X) and number of workers. Glue starts a cluster, runs your script, and terminates the cluster — no idle cost between runs.

Glue Spark scripts use the GlueContext API, which wraps SparkContext and adds Glue-specific functionality.

### DynamicFrames

Glue introduces DynamicFrame as an alternative to Spark DataFrames. DynamicFrames handle schema inconsistencies — inconsistent types, missing fields, unexpected nulls — without failing. Each row carries its own schema metadata.

DynamicFrames are useful for messy source data where you want to resolve inconsistencies before imposing a strict schema. Use DynamicFrame.toDF() to convert to a standard Spark DataFrame for transformations that require a fixed schema.

Key DynamicFrame operations:

- glue_context.create_dynamic_frame.from_catalog(database, table_name) — read from Data Catalog

- dynamic_frame.applyMapping([("source_col", "string", "target_col", "long")]) — column mapping and type casting

- dynamic_frame.resolveChoice(choice="make_struct") — handle ambiguous types

- dynamic_frame.filter(fn) — row filtering

### Reading from and Writing to S3

Read Parquet from S3:

datasource = glue_context.create_dynamic_frame.from_options(

connection_type="s3",

connection_options={"paths": ["s3://bucket/prefix/"], "recurse": True},

format="parquet"

)

Write to S3 as Parquet partitioned by date:

glue_context.write_dynamic_frame.from_options(

frame=output_frame,

connection_type="s3",

connection_options={"path": "s3://output-bucket/table/", "partitionKeys": ["date"]},

format="parquet"

)

### JDBC Sources

Connect to RDS PostgreSQL, Aurora, SQL Server:

datasource = glue_context.create_dynamic_frame.from_options(

connection_type="postgresql",

connection_options={"url": jdbc_url, "dbtable": "schema.tablename", "user": user, "password": password}

)

For large tables, use hash partitioning to parallelise JDBC reads: specify hashfield (the partition column), hashpartitions (number of parallel reads), and hashexpression.

Glue Data Catalog as a Shared Schema Layer

The Data Catalog's primary value is acting as the schema source of truth for multiple query engines:

**Athena**: Queries Glue Catalog tables directly — no schema duplication needed. Define a table once in the Catalog, query it in Athena immediately.

**Redshift Spectrum**: References Glue Catalog external tables via CREATE EXTERNAL SCHEMA ... FROM DATA CATALOG. Redshift joins internal managed tables with S3 external tables in a single query.

**EMR**: Uses Glue as its Hive Metastore — EMR Spark and Hive jobs read and write Catalog table definitions.

**Glue Studio**: The visual ETL interface generates and runs Glue Spark code from a visual canvas — source, transform, and target nodes configured through UI. Good for simple transformation patterns; complex logic still requires code.

Glue vs Alternatives

**Glue vs self-managed Spark (EMR)**: Glue eliminates cluster management — no provisioning, patching, or right-sizing. EMR gives more control: choose Spark version, tune JVM flags, run long-lived clusters for interactive notebooks. For batch ETL jobs without complex tuning requirements, Glue is simpler. For complex jobs requiring specific configurations or persistent clusters for interactive work, EMR is better.

**Glue vs Airflow on MWAA**: Glue handles compute and execution; Airflow handles orchestration. Many teams use both: Airflow DAGs invoke Glue jobs via the GlueJobOperator. Glue Workflows can replace Airflow for simple sequential pipelines within AWS, but Airflow handles cross-service orchestration and complex dependency patterns better.

**Glue vs Lambda for ETL**: Lambda is suitable for lightweight event-driven processing (single files, small payloads under 15 minutes). Glue is for distributed processing of large datasets. Use Lambda for S3 event triggers on individual files; use Glue for batch aggregation and transformation of dataset-level operations.

**Glue vs dbt**: These are complementary. Glue runs Python/Spark for large-scale data processing and ingestion. dbt runs SQL transformations in the warehouse. A common pattern: Glue lands raw data in S3 and the Data Catalog; dbt transforms data in Redshift or Athena using Catalog schema definitions.

Cost Considerations

Glue billing is per DPU-hour (Data Processing Unit). One G.1X worker = 1 DPU; one G.2X = 2 DPUs. Minimum billing is 1 minute; jobs are billed for actual run time, not provisioning time.

Cost optimisation:

- Right-size worker count — profile job data volume and parallelism requirements; over-provisioned workers waste money

- Use job bookmarks to process only new data on incremental runs rather than full reprocessing

- Schedule jobs during off-peak windows for consistent capacity

- Evaluate Python Shell for small data jobs that don't need distributed Spark — Python Shell is significantly cheaper per execution

For large, frequent jobs running many hours daily, managed EMR clusters or self-hosted Spark may be more cost-effective than Glue's per-DPU-hour pricing.

Glue Job Bookmarks

Bookmarks track which data has already been processed, enabling incremental processing without custom state management. Enable with job.init() with bookmark enabled.

Glue tracks state per job: for S3 sources, it records which files have been processed; for JDBC sources, it tracks the last processed record. On the next job run, Glue reads only new data since the last bookmark.

Bookmarks are most reliable with S3 sources and monotonically increasing data. For complex incremental logic — CDC, date-range partitions, merge operations — implement custom state management using DynamoDB or SSM Parameter Store rather than relying on bookmarks.

Our data architecture practice designs and builds AWS data pipelines using Glue, dbt, and Redshift — contact us to discuss your AWS data stack.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →