BlogData Engineering

Python for Data Engineering: The Libraries, Patterns, and Practices That Actually Matter

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·March 14, 202712 min read

Python is the dominant language for data engineering — but the gap between Python that works and Python that runs reliably in production is significant. This guide covers the libraries, architectural patterns, and operational practices that separate production-grade data pipelines from scripts that work on the author's laptop.

Python became the dominant data engineering language for practical reasons: excellent library ecosystem, readable syntax, strong community, and compatibility with the tools the field converged on (Airflow, dbt, Spark, pandas, SQLAlchemy). But "Python for data engineering" covers a large surface area — the gap between a data engineering script and a production data pipeline is significant.

This guide focuses on the libraries, patterns, and practices that are actually important for building reliable production data pipelines, not introductory syntax.

Core Libraries

**pandas** is the standard for in-memory data manipulation: DataFrames, Series, groupby, merge, pivot. For datasets that fit in memory (typically under a few hundred MB as a CSV), pandas handles ETL transformations reliably. For larger datasets, pandas becomes problematic — it loads the entire dataset into RAM, which creates memory pressure on typical pipeline worker nodes.

**Polars** is the successor most data engineers are migrating to for medium-to-large in-memory processing. It is written in Rust, uses lazy evaluation, and is 5–20x faster than pandas for most operations. Its API is similar enough to pandas that the learning curve is low. For new pipelines where performance matters, Polars is often the better default.

**SQLAlchemy** is the standard database connection library. It provides both an ORM (for applications) and a Core expression language (for data engineering). In data engineering contexts, use SQLAlchemy Core with the text() function for raw SQL or the select/insert/update constructs for parameterised queries. SQLAlchemy handles connection pooling, dialect differences between databases, and parameterised queries safely (preventing SQL injection).

**Pydantic** is the standard for data validation at pipeline boundaries. Define a Pydantic model for the expected schema of data entering your pipeline. Validate incoming records against the model. Records that fail validation are either rejected with a clear error or routed to a quarantine table. Pydantic's validation errors are structured and specific — 'field revenue: value is not a valid float' rather than 'TypeError: unsupported operand type(s) for +'.

**httpx** or **requests** for HTTP calls to external APIs. httpx is preferred for new code: it supports async natively, handles HTTP/2, and has a cleaner API than requests for complex use cases. For simple synchronous API calls, requests is still fine.

**boto3** for AWS services. If your data pipeline interacts with S3, SQS, DynamoDB, Secrets Manager, or other AWS services, boto3 is the interface. For Google Cloud, the google-cloud-* client libraries provide equivalent functionality.

Environment and Dependency Management

Every production data pipeline runs in an isolated environment. The two main approaches:

**Virtual environments + pip + requirements.txt** is the simplest approach and still works fine for straightforward pipelines. Pin all dependencies to exact versions in requirements.txt (pip freeze > requirements.txt). Unpinned dependencies mean that a new package release can silently change pipeline behaviour between runs.

**Poetry** or **uv** manage dependencies with lock files that capture the complete resolved dependency tree including transitive dependencies. This is stricter than requirements.txt and avoids the "works on my machine" class of dependency problems. uv is significantly faster than pip for dependency resolution and installation — for CI/CD pipelines that install dependencies frequently, the speed difference is material.

Never install data engineering dependencies into the system Python. Always use an isolated virtual environment, a Docker container, or a conda environment. Data engineering libraries have complex dependency trees that conflict with each other and with system packages.

Connection Management Patterns

Database connections in pipelines have two common failure modes: not closing connections (resource leak that eventually exhausts the connection pool) and not pooling connections (overhead from establishing a new connection for each query).

Use SQLAlchemy's connection pool for database connections. The create_engine() function returns an engine that manages a connection pool automatically. Acquire connections using the engine's context manager:

with engine.connect() as conn:

result = conn.execute(text("SELECT * FROM orders WHERE date = :date"), {"date": target_date})

rows = result.fetchall()

The 'with' block ensures the connection is returned to the pool when the block exits, even if an exception occurs. Do not call conn.close() manually — let the context manager handle it.

For external API connections, respect rate limits. Most APIs return a 429 Too Many Requests response when the client exceeds the rate limit. Implement exponential backoff: on a 429, wait 2 seconds, retry; if it fails again, wait 4 seconds; then 8, 16, up to a maximum. The tenacity library provides a retry decorator that implements backoff with a few lines of configuration.

Idempotency

A production data pipeline must be safe to rerun. If a pipeline run fails halfway through and you restart it, the output should be the same as if it had run successfully the first time. This property — idempotency — is non-negotiable for reliable pipelines.

The most reliable idempotency pattern for SQL targets: delete-then-insert. Before inserting new data for a given date partition, delete all existing records for that partition:

with engine.begin() as conn:

conn.execute(text("DELETE FROM fact_orders WHERE order_date = :date"), {"date": target_date})

conn.executemany(insert_sql, rows_to_insert)

Both operations are in a single transaction (engine.begin() starts a transaction; it commits on context manager exit and rolls back on exception). If the insert fails, the delete is also rolled back — the table retains its original state and the pipeline can safely retry.

Avoid INSERT IGNORE, INSERT OR REPLACE, or UPSERT patterns as your primary idempotency mechanism unless you have carefully verified their behaviour with your specific database and schema. These patterns have subtle edge cases around generated columns, triggers, and multi-row operations that can produce incorrect results.

Secrets Management

Credentials do not belong in code. Not in source files, not in environment variable strings hardcoded in configuration files, not in Airflow connection definitions stored in plaintext. There are only two acceptable places for production credentials: a secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) or environment variables injected at runtime by your deployment platform.

Access secrets from Python:

import boto3

import json

def get_secret(secret_name):

client = boto3.client('secretsmanager', region_name='us-east-1')

response = client.get_secret_value(SecretId=secret_name)

return json.loads(response['SecretString'])

db_credentials = get_secret('prod/pipeline/database-credentials')

connection_string = f"postgresql://{db_credentials['username']}:{db_credentials['password']}@{db_credentials['host']}/{db_credentials['database']}"

Cache the secret for the duration of the pipeline run. Do not call the secrets manager on every database connection — there are API rate limits and latency costs.

Logging

Data pipeline logs are operational tools. When something goes wrong at 3 AM, the on-call engineer reads the logs. Good logs make diagnose-and-fix take 10 minutes; bad logs make it take 2 hours.

Use Python's built-in logging module, not print(). Configure it at pipeline startup with a structured format:

import logging

logging.basicConfig(

level=logging.INFO,

format='%(asctime)s %(levelname)s %(name)s %(message)s'

)

logger = logging.getLogger(__name__)

Log at the right level: DEBUG for diagnostic detail that is too verbose for production but useful when debugging; INFO for normal pipeline progress events (started processing batch, loaded N rows, completed); WARNING for unexpected but handled conditions (retrying after 429, skipping record with null key); ERROR for failures that interrupt processing.

Include context in log messages. "Error processing record" is useless. "Error processing record: order_id=12345, source=shopify_orders, error=KeyError: 'line_items'" gives the on-call engineer something to work with.

Testing Data Pipelines

Unit tests for data engineering pipelines focus on transformation logic, not integration with external systems. Mock the database connection and test the transformation function with known inputs and expected outputs.

Integration tests use a real (but non-production) database and a representative sample of source data. They are slower to run but catch the class of bugs that unit tests cannot — incorrect SQL, schema mismatches, type conversion issues between Python types and database types.

Use pytest for both unit and integration tests. Parametrize tests to cover edge cases: empty input, single-row input, records with null values in optional columns, records that exceed maximum field lengths.

Run the full test suite in CI before any merge to main. Data pipeline bugs that reach production are expensive — bad data propagates into dashboards and reports before anyone notices.

Our data engineering practice builds production-grade Python data pipelines — contact us to discuss your pipeline architecture and engineering requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →