Healthcare data architecture must navigate HIPAA compliance, HL7/FHIR integration standards, clinical system complexity (EHR, PACS, lab systems), and the need to integrate disparate data sources for population health and clinical analytics.
The quick answer
Healthcare data architecture must solve problems that most enterprise data architectures do not face: HIPAA compliance with its strict PHI handling requirements, clinical system integration across disparate EMR/EHR systems using HL7 and FHIR standards, near-real-time requirements for clinical decision support, and the challenge of linking patient data across systems that use different patient identifiers. The architecture patterns used in financial services and retail do not transfer directly — healthcare has specific requirements that shape every architectural decision.
This guide covers the regulatory requirements, integration standards, architecture patterns, and common failure modes specific to healthcare data environments.
HIPAA requirements and their architectural implications
HIPAA (Health Insurance Portability and Accountability Act) governs the handling of Protected Health Information (PHI) — individually identifiable health information. Any architecture that stores, processes, or transmits PHI must comply with the HIPAA Security Rule (technical safeguards for electronic PHI), Privacy Rule (use and disclosure constraints), and Breach Notification Rule (incident response requirements).
**PHI identification**: PHI includes 18 identified data elements — name, date of birth, geographic data below state level, dates (other than year), phone/fax numbers, email addresses, SSN, medical record numbers, health plan beneficiary numbers, account numbers, certificate/license numbers, vehicle identifiers, device identifiers, web URLs, IP addresses, biometric identifiers, photographs, and any other unique identifying number or code. Any dataset containing any of these elements combined with health information is PHI and subject to HIPAA.
**De-identification**: the two HIPAA de-identification methods are Expert Determination (a statistician certifies that the risk of re-identification is very small) and Safe Harbor (all 18 PHI identifiers are removed or generalised). De-identified data is not subject to HIPAA and can be used more freely for analytics. However, de-identification reduces analytical utility — date of birth becomes age range, geographic data becomes region.
**Access controls**: HIPAA requires role-based access controls where users access only the PHI required for their work. Minimum necessary access principle. Technical controls: column-level masking for PHI fields accessible only to authorised roles, row-level security for patient data (clinicians access only their patients), audit logging of all PHI access.
**Encryption**: PHI must be encrypted in transit (TLS 1.2+) and at rest. Cloud data warehouse encryption at rest (Snowflake, BigQuery, Azure Synapse all provide AES-256 encryption by default). Field-level encryption for particularly sensitive fields (SSN, MRN) adds an additional protection layer beyond storage encryption.
**Business Associate Agreements (BAAs)**: every cloud service provider, SaaS vendor, or consulting firm that handles PHI must sign a Business Associate Agreement with the covered entity. Verify that your cloud data warehouse provider, ETL tool, BI platform, and any analytics consulting firm have signed BAAs before transmitting PHI to those services. Major cloud providers (AWS, Azure, GCP) and most enterprise data tools offer BAAs; smaller vendors may not.
**Audit logging**: every access to PHI must be logged — who accessed what patient data, when, for what purpose. The audit log must be tamper-evident and retained for 6 years (longer than the HIPAA minimum in many contracts). Snowflake's Query History, BigQuery's Data Access logs, and warehouse-level audit logging typically satisfy this requirement; verify against your compliance team's interpretation.
Clinical system integration: HL7 and FHIR
Healthcare data arrives from a diverse set of source systems — Electronic Health Records (EHR/EMR), Picture Archiving and Communication Systems (PACS) for medical imaging, laboratory information systems, pharmacy systems, billing and revenue cycle systems, and patient-generated health data. Integration between these systems is governed by two standards:
**HL7 v2** is the legacy message-based standard. Clinical systems exchange ADT (Admit, Discharge, Transfer) messages, lab results (ORU messages), orders (ORM), and patient demographics in HL7 v2 format — pipe-delimited ASCII messages. Most legacy EHR systems (Epic, Cerner, Meditech) produce HL7 v2 messages. Ingesting HL7 v2 requires a parser that understands the message format and maps it to a relational or JSON structure. Tools: Mirth Connect (open-source integration engine), Microsoft Azure API for FHIR, AWS HealthLake.
**HL7 FHIR** (Fast Healthcare Interoperability Resources) is the modern standard. FHIR uses REST APIs and JSON/XML resource representations. FHIR resources map clinical concepts to a defined schema: Patient, Observation, Condition, Medication, Procedure, Encounter, Organization. Modern EHR systems (Epic on FHIR, Cerner Millennium) expose FHIR APIs. FHIR enables programmatic access to clinical data without requiring HL7 v2 message parsing.
For new data platform builds, FHIR is the preferred integration standard — it is queryable via standard HTTP, well-documented, and increasingly required by regulation (US CMS Interoperability Rule mandates FHIR APIs for health plan data). For legacy systems that do not support FHIR, HL7 v2 integration via an integration engine (Mirth, Azure FHIR Service, Health Catalyst) remains necessary.
**DICOM** (Digital Imaging and Communications in Medicine) is the standard for medical imaging data — CT scans, MRI, X-ray, ultrasound. DICOM images and metadata are stored in PACS systems. Integrating DICOM data into an analytics platform requires DICOM metadata extraction (patient demographics, study date, modality, anatomical region) into structured tables; the raw image data is typically stored in a separate medical imaging archive rather than a data warehouse.
Healthcare data domains and their integration challenges
**Patient master data**: the core challenge of healthcare analytics is linking patient records across systems. A patient in the EHR, the billing system, and the lab system may have different identifiers. Master Patient Index (MPI) management — entity resolution to link records representing the same patient across systems — is a prerequisite for accurate patient-centric analytics. MPI systems use probabilistic matching on name, date of birth, SSN (where available), address, and phone number to link records.
**Clinical data**: encounter data (visit dates, provider, facility, encounter type), diagnosis codes (ICD-10), procedure codes (CPT, HCPCS), lab results (LOINC-coded), medication administrations, and vital signs. Clinical data typically arrives in HL7 v2 or FHIR format and requires significant transformation to be analytically useful — ICD-10 codes must be grouped into clinical conditions, LOINC codes must be mapped to readable lab names, medication codes must be normalised to drug names and classes.
**Claims data**: insurance claims data provides a different view of patient care — claims submitted to insurers for reimbursement, with revenue cycle timing (claim submitted, adjudicated, paid). Claims data is coded in ICD-10 (diagnoses) and CPT (procedures) and is often used for population health analytics because it covers all care across providers, not just care at a single health system.
**EHR-specific data**: each EHR system (Epic, Cerner, Allscripts) has its own data model. Epic's Clarity database (reporting SQL Server database) and Cogito warehouse are the standard Epic reporting infrastructure; pulling data from these requires understanding Epic's complex schema. Cerner's FHIR API and HealtheIntent platform offer more standardised access. For health systems that have not standardised on a single EHR, multi-EHR integration is a significant architecture challenge.
Architecture patterns for healthcare analytics
### Clinical data warehouse
The healthcare clinical data warehouse follows similar principles to other enterprise data warehouses — a central repository integrating data from multiple source systems — with healthcare-specific considerations:
The source layer ingests from EHR/EMR (via HL7 v2 or FHIR), claims systems, lab systems, pharmacy systems, and device data. Integration engines (Mirth Connect, Azure FHIR Service) normalise the HL7/FHIR messages before loading to the warehouse. The integration layer standardises clinical codes (mapping ICD-10 codes, LOINC codes, CPT codes to canonical references), resolves patient identity via MPI, and produces the Silver layer of standardised clinical entities.
The analytics layer (Gold) produces clinical fact and dimension tables: fact_encounters (one row per patient encounter), fact_labs (one row per lab result), dim_patients, dim_providers, dim_facilities, dim_diagnoses (ICD-10 with rollups to condition categories), dim_procedures (CPT with service category groups).
**OMOP CDM** (Observational Medical Outcomes Partnership Common Data Model) is a standardised data model for clinical data, maintained by OHDSI (Observational Health Data Sciences and Informatics). Mapping to OMOP CDM enables use of OHDSI's analytics tools (ATLAS, Achilles) and facilitates multi-site research collaboration where contributing health systems share OMOP-formatted data without sharing raw PHI. OMOP CDM is increasingly adopted by academic medical centres and health systems participating in research networks.
### Population health platform
Population health analytics identifies and manages high-risk patient populations to improve outcomes and reduce cost. The architecture extends the clinical data warehouse with:
**Risk stratification models**: ML models that score each patient's risk of hospitalisation, readmission, disease progression, or care gap. Risk models require time-series clinical features (vital sign trends, lab result trends, medication adherence), structured clinical data (diagnoses, procedures), and social determinants of health (SDOH) data (address-based socioeconomic indicators from sources like the USDA Food Atlas, Area Deprivation Index).
**Care gap identification**: registry-based analytics identifying patients who are due for preventive care (mammography, colonoscopy, A1c monitoring for diabetic patients). Care gap lists are distributed to care managers for outreach.
**Outcome measurement**: population-level outcome measurement — HEDIS measures (Healthcare Effectiveness Data and Information Set), hospital-specific quality metrics, CMS Star Ratings components.
### Real-time clinical decision support (CDS)
Clinical decision support — alerts and recommendations delivered to clinicians at the point of care — requires sub-second response times and must integrate with EHR workflows. Architecture: event-driven (HL7 v2 messages or FHIR hooks trigger CDS evaluation), low-latency feature retrieval (online feature store for pre-computed patient risk features), inference endpoint (ML model or rules engine), and EHR integration (SMART on FHIR for modern EHRs, HL7 CDS Hooks for standards-based integration).
Real-time CDS is architecturally distinct from population health analytics — it requires streaming infrastructure, low-latency feature serving, and EHR integration that most data platforms do not have. It is typically built as a separate system from the analytical data warehouse, though sharing feature definitions with the analytics platform is best practice.
Cloud platform considerations for healthcare
AWS has the broadest healthcare-specific managed services: Amazon HealthLake (FHIR-native data store with natural language processing), AWS Comprehend Medical (NLP for clinical notes), and AWS HealthImaging (DICOM-optimised storage). AWS HIPAA compliance documentation and available BAAs are mature.
Azure has Azure Health Data Services (FHIR Service, DICOM Service, MedTech Service for IoT device data) and integrates with Azure Synapse, Databricks, and Power BI for the analytics layer. For organisations already on Azure, Health Data Services reduces integration friction.
GCP has Cloud Healthcare API (FHIR, DICOM, HL7 v2 APIs), integration with BigQuery (streaming clinical data to BigQuery for analysis), and Vertex AI for ML model development. GCP's healthcare portfolio has strengthened significantly but is less mature than AWS and Azure for enterprise healthcare deployments.
All major cloud providers offer HIPAA BAAs and the compliance certifications required for healthcare data. The platform decision is typically driven by existing cloud investment and EHR vendor relationships rather than healthcare-specific platform differentiators.
For the data governance context that is especially critical in healthcare (PHI access control, lineage for regulatory requirements, quality standards for clinical data), see how to build a data governance framework and data lineage. For the financial services comparison, see data architecture for financial services.
Our data architecture consulting practice has designed healthcare data platforms for health systems, health plans, and digital health companies. If you are designing a healthcare data architecture or evaluating your current platform against modern patterns, book a free 30-minute audit.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →