BlogData Engineering

What Is Data Profiling? Understanding Your Data Before You Build

James Okafor
James Okafor
Senior Data Engineer
·June 1, 20289 min read

Data profiling is the process of examining datasets to understand their structure, content, completeness, and quality. This guide explains what data profiling reveals, how it fits into data engineering workflows, and why skipping it is one of the most expensive mistakes in analytics projects.

Data profiling is the process of examining a dataset to understand its structure, content, quality, and relationships — before building pipelines, data models, or analytics on top of it. The goal is to surface what the data actually contains, as opposed to what documentation says it should contain. In practice, these two descriptions rarely agree.

Skipping data profiling is one of the most expensive decisions in analytics engineering. Projects that skip it discover data quality issues late — during development, or worse, after deployment when users find incorrect numbers in dashboards. Fixing a data quality issue discovered during profiling takes hours. Fixing the same issue after a data warehouse migration has already been designed around it takes months.

What Data Profiling Reveals

**Column-level statistics** — for each column: count of non-null values, count of nulls, percentage null, count of distinct values, min, max, mean, standard deviation (for numeric fields), most frequent values, least frequent values. These statistics immediately reveal anomalies: a column that should be an ID but has only three distinct values; a date column where 40% of values are null; a numeric field with a maximum value that is clearly an error.

**Format and pattern distribution** — for string fields, the distribution of formats in the data. A phone number field containing values in 12 different formats, some with country codes and some without. An email field where 8% of values do not contain "@". A postal code field containing both 5-digit and 9-digit values.

**Referential integrity** — whether foreign key relationships actually hold in the data. A customer_id field in an orders table that references customer IDs in the customer table — do all customer_ids in the orders table exist in the customer table? In production systems, referential integrity constraints are often disabled for performance reasons, meaning orphaned records accumulate silently.

**Duplicate detection** — whether the data contains duplicate records, and what the duplication pattern is. Full duplicates (every field identical) versus partial duplicates (same business key, different attributes). A customer table with 2% duplicate email addresses creates double-counting errors in any metric that aggregates by customer.

**Value distribution and outliers** — whether the distribution of values in key fields is consistent with expectations. Revenue values where 0.1% of records account for 60% of total revenue may indicate legitimate whale customers, or may indicate data entry errors. Date fields where all transactions cluster around month-end may indicate batch processing, or may indicate a data quality issue.

**Cross-column relationships** — whether values in one column are consistent with values in another. Order dates that precede customer creation dates. Shipment dates that precede order dates. End dates that precede start dates. These logical inconsistencies are invisible to column-level profiling but surface immediately in cross-column analysis.

When to Profile

**Before a new data source integration.** When connecting a new source system to a data warehouse — a CRM, an ERP, a third-party data feed — profiling the source data before designing the integration reveals the quality issues that will become pipeline failures or silent errors later.

**Before a data warehouse migration.** Migrating from one warehouse technology to another is an opportunity to assess the quality of existing data models. Profiling reveals which datasets have undocumented assumptions baked in (date fields that assume a specific timezone; ID fields that assume all values are integers).

**After a significant change to source systems.** When an upstream system changes — a Salesforce field is repurposed, a transaction system is upgraded, a new product line changes the schema — profiling the changed data reveals the downstream impacts before they cascade into broken dashboards.

**As part of ongoing data quality monitoring.** Profiling is not a one-time exercise. Production data quality degrades over time as source systems change, user behavior changes, and edge cases accumulate. Continuous profiling — scheduled column-level statistics and anomaly detection — is the surveillance mechanism that catches degradation before users do.

Tools and Approaches

**Manual SQL profiling** — writing SQL queries to compute column statistics, check referential integrity, and identify duplicates. Fast to set up, flexible, but requires consistent effort to maintain as schemas change.

**dbt tests** — dbt's built-in testing framework (not_null, unique, accepted_values, relationships) operationalizes profiling as automated tests that run on every pipeline execution. When a dbt test fails, the pipeline fails — which is the correct behavior for data quality issues that would otherwise propagate silently.

**Great Expectations** — an open-source Python library that allows engineers to define "expectations" about data (this column should never be null, this field should always be positive, this column should contain only these values) and validate them automatically. Integrates with data pipeline orchestration tools.

**Monte Carlo, Bigeye, Soda** — commercial data observability platforms that automate anomaly detection across warehouse tables. Rather than requiring engineers to define every expectation explicitly, these tools learn the baseline distribution of each column and alert when the distribution changes meaningfully. Particularly useful for large warehouse environments where manual expectation definition is impractical at scale.

**dbt-profiler** — a dbt package that generates column-level statistics as dbt models, making profiling output queryable within the warehouse and comparable across runs.

Data Profiling vs. Data Quality Management

Data profiling is an assessment activity — it tells you what the data contains. Data quality management is an operational discipline that uses profiling as one input among several (including source system controls, pipeline monitoring, user feedback, and business rule validation) to maintain data quality over time.

Profiling without follow-through — discovering quality issues and not acting on them — is worse than not profiling, because it creates documented evidence that known problems were not addressed. The output of profiling should feed directly into a data quality issue backlog with owners and remediation timelines.

Our data architecture practice designs data quality frameworks that incorporate profiling into the development and operational lifecycle of data pipelines. Contact us to discuss your data quality challenges.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →