Privacy engineering translates legal requirements — GDPR, CCPA, HIPAA — into technical implementations. This guide covers the key techniques: anonymisation, pseudonymisation, tokenisation, differential privacy, and the data architecture decisions that make compliance sustainable.
Privacy engineering translates legal requirements — GDPR, CCPA, HIPAA, LGPD — into technical implementations that are sustainable, auditable, and proportionate. The law says "do not retain data longer than necessary." Privacy engineering translates that into: retention policies in your data warehouse, automated deletion pipelines, and audit trails proving compliance. This guide covers the key techniques and the data architecture decisions that make privacy compliance operationally manageable.
Anonymisation vs pseudonymisation
These two terms are frequently confused and have different legal implications under GDPR.
**Anonymisation** produces data from which the data subject cannot be identified — directly or indirectly, by any means reasonably likely to be used. True anonymisation takes data outside the scope of GDPR because the data is no longer personal data. However, genuine anonymisation is very difficult to achieve — many datasets that appear anonymous can be re-identified when combined with other data sources. GDPR Article 29 Working Party guidance sets a high bar for what counts as truly anonymised.
**Pseudonymisation** replaces identifying information with pseudonyms — tokens, hashes, or artificial identifiers — while retaining the ability to re-identify the individual with access to the mapping table. Pseudonymised data is still personal data under GDPR and requires all the same protections, but pseudonymisation is recognised as a technical measure that reduces privacy risk and is factored positively into compliance assessments.
For most enterprise data engineering contexts, you are implementing pseudonymisation rather than true anonymisation.
Pseudonymisation techniques
**Tokenisation**: Replace a sensitive value (email address, SSN, customer ID) with a non-sensitive token. The mapping from token to original value is maintained in a secure token vault. Analytics systems work with tokens; the vault is accessible only to systems that require the original value for operations. Tokenisation is reversible (with vault access) by design.
**Cryptographic hashing**: Replace the sensitive value with a one-way hash (SHA-256, SHA-3). The hash cannot be reversed to the original value without a rainbow table attack. For analytics use cases where re-identification is not required, hashing email addresses for analytics joins works — if two systems join on the hashed email, they can correlate without either system knowing the actual email.
Limitation: hashed values are susceptible to dictionary attacks for low-entropy inputs. A hashed SSN drawn from a 9-digit space can be brute-forced. Add a secret salt to hashes for inputs with limited cardinality.
**Encryption**: Encrypt sensitive columns with AES-256 or equivalent. Decryption requires the key. Key management (where the key is stored, who has access, how the key rotates) is the primary operational complexity. Snowflake, BigQuery, and Redshift all support customer-managed encryption keys (CMEK) for column-level encryption.
Differential privacy
Differential privacy is a mathematical framework that adds calibrated noise to query results, ensuring that the presence or absence of any individual's data in the dataset cannot be inferred from the query output. It is used when you need to publish aggregate statistics (a cohort average, a demographic breakdown) while preventing re-identification from the statistical output.
Practical implementations: Apple and Google use differential privacy for aggregate telemetry collection. Apple's LocalDP implementation adds noise on-device before sending data. OpenDP and Google's differential privacy library provide open-source implementations. For organisations publishing aggregate reports from sensitive data (healthcare statistics, salary surveys), differential privacy provides quantifiable privacy guarantees.
Differential privacy is mathematically rigorous but operationally complex. For most enterprise analytics contexts, pseudonymisation and access controls are the primary technical measures; differential privacy is warranted for published reports where re-identification risk from aggregates is a concern.
Data retention and deletion
GDPR's storage limitation principle (Article 5(1)(e)) requires that personal data is not retained longer than necessary for the purpose it was collected. Engineering this requires:
**Retention policy definition**: Define retention periods for each data category (e.g., customer transaction data retained for 7 years for tax compliance; marketing event data retained for 2 years; support chat logs retained for 1 year after ticket closure).
**Automated deletion pipelines**: Implement scheduled jobs that delete records older than the retention period. For data warehouses, DELETE WHERE created_at < retention_cutoff. For data lakes, partition expiration on object storage. For backup copies, coordinate deletion across backup retention cycles.
**Right to erasure (Article 17)**: A data subject can request deletion of their personal data. Engineering this requires: knowing where all instances of a data subject's data exist across all systems (data lineage and mapping), the ability to delete or anonymise those records, and a process for handling deletion requests within 30 days.
Data subject deletion is operationally complex because personal data exists in many places: the operational database, the warehouse, data lake snapshots, backup copies, BI tool extracts, and data shared with third parties. A data map (which systems hold personal data for which categories of subjects) is a prerequisite for executing erasure efficiently.
Data minimisation
GDPR's data minimisation principle (Article 5(1)(c)) requires collecting only data necessary for the specified purpose. Privacy engineering translates this into:
**Column-level access controls**: Apply column-level masking to sensitive columns in Snowflake, BigQuery, and Databricks so that analysts who do not need the raw value see a masked or tokenised version. Snowflake Dynamic Data Masking, BigQuery Column Security, and Databricks Column Masks implement this at the warehouse layer.
**Purpose-limited data sources**: Build analytical data sources that include only the columns needed for the analytics use case. A marketing analytics data source should not include health data collected for a clinical purpose. The data model enforces purpose limitation by excluding irrelevant columns.
**Consent enforcement**: If data was collected with specific consent (opted in to marketing communications), analytical systems should respect that consent — exclude customers who have not consented from marketing analytics and models.
GDPR compliance architecture patterns
**Pseudonymisation at ingestion**: Apply tokenisation or hashing at the point data enters your data platform, before it reaches the warehouse. The warehouse never sees raw PII; only the token vault system (which is separate and access-controlled) stores the PII-to-token mapping.
**Privacy vault pattern**: A central pseudonymisation service receives PII, stores it in an access-controlled vault, and returns tokens. All other systems receive only tokens. Re-identification is possible only via the vault, with appropriate access controls and audit logging.
**Data subject registry**: Maintain a mapping from data subject identifiers (customer IDs) to all systems that hold their data. Used to execute right-of-access and right-to-erasure requests efficiently.
For the broader governance context, see data governance framework. Our data architecture consulting practice designs privacy-by-design data architectures for organisations with GDPR, CCPA, and HIPAA obligations — book a free architecture review.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →