HIPAA Compliance for AI: What Healthcare AI Builders Need to Know
HIPAA compliance for AI is the practice of ensuring that artificial intelligence systems—including model training pipelines, inference services, and data workflows—meet the requirements of the Health Insurance Portability and Accountability Act. Specifically, this means applying HIPAA’s Privacy Rule, Security Rule, and de-identification standards to every stage of AI development that touches Protected Health Information (PHI).


You're building AI that can transform healthcare—diagnostic assistants, clinical decision support, workflow automation. But the moment you add real patient data to that model, you've entered regulated territory. HIPAA compliance for AI isn't optional; it's a legal obligation that carries penalties up to $1.5 million per violation category, per year.
The challenge? HIPAA compliance rules were written before modern AI existed. They don't mention machine learning pipelines, large language models, or the specific risks that come with training AI on sensitive health data. That's left healthcare AI builders in a bind: follow 1990s compliance guidance with 2020s technology, or risk substantial penalties.
This guide walks you through what HIPAA actually requires for AI systems, how to interpret those requirements for modern workflows, and practical steps to build compliant AI without sacrificing functionality or performance.
What this article covers:
- How HIPAA applies to AI systems and training data
- De-identification standards that reduce risk
- The two paths to HIPAA compliance: Safe Harbor and Expert Determination
- Building compliant AI pipelines from data intake to model deployment
- Common pitfalls and how to avoid them
Understanding HIPAA and Protected Health Information
What HIPAA actually regulates
The Health Insurance Portability and Accountability Act (HIPAA) doesn't restrict what you do with health data—it restricts what happens when that data is linked to individuals. Once health data is identifiable—meaning someone could reasonably figure out whose information it is—it becomes Protected Health Information (PHI). Once it's PHI, HIPAA applies.
The law covers three groups:
- Covered entities: healthcare providers (doctors, hospitals, clinics), health plans (insurance companies), and healthcare clearinghouses
- Business associates: any organization that processes, stores, or handles PHI on behalf of a covered entity
- Subcontractors: vendors hired by business associates
If you're building AI for a hospital, you're likely a business associate or contractor. That means HIPAA compliance is your responsibility, not just theirs.
What counts as PHI in AI contexts
PHI includes anything that could identify a patient: names, medical record numbers, dates of birth, diagnoses, medications, appointment notes, lab results, imaging descriptions, and clinical narratives. But in unstructured data—free-text notes, call transcripts, audio recordings—PHI is often buried inside natural language, where standard tools miss it.
Here's the practical reality: a hospital's AI team trains a model on clinical notes to predict patient readmission. Those notes contain PHI. If the notes aren't de-identified first, the hospital has just violated HIPAA by creating a dataset where patients are identifiable. The violation occurred during training, before any deployment.
Why AI creates new compliance risks
Traditional healthcare IT systems are controlled environments: data flows through audited systems, access is logged, and retention periods are defined. AI systems add complexity:
- Data multiplication: Training data is copied to development, staging, and production environments—each a potential exposure point
- Model leakage: Trained models can memorize and regurgitate training data, including PHI, especially when fine-tuned on sensitive datasets
- Inference risks: When an AI system processes live patient data (even if trained on de-identified data), it generates new records that may be identifiable
- Auditability: It's harder to prove how a model arrived at a decision, making forensic investigation of breaches more difficult
HIPAA compliance for AI isn't just about de-identifying training data—it's about controlling data flow across the entire lifecycle.
The two paths to HIPAA compliance: Safe Harbor and Expert Determination
HIPAA gives you two ways to transform PHI into de-identified information that falls outside HIPAA's scope entirely. Understanding the difference is critical because each path has different technical and resource requirements.
Safe Harbor method
Safe Harbor is a checklist approach: remove 18 specific identifiers from a dataset, and HIPAA considers the data de-identified. No further proof needed.
The 18 HIPAA identifiers to remove:
- Names
- Geographic subdivisions smaller than a state, including street address, city, county, and zip code
- Dates directly related to an individual (except year)—including birth date, admission and discharge dates, and dates of death; for individuals 90 years or older, all dates including year
- Phone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate and license numbers
- Vehicle identifiers and serial numbers, including license plate numbers
- Device identifiers and serial numbers
- Web URLs
- IP addresses
- Biometric identifiers, including fingerprints and voiceprints
- Full-face photographs and comparable images
- Any other unique identifying number, characteristic, or code
When Safe Harbor works for AI:
Safe Harbor is straightforward for structured data. A hospital dataset with columns for diagnosis, medication, lab value, and date? Remove the date, aggregate patient IDs, and you're done. Safe Harbor applies.
When Safe Harbor doesn't work:
Safe Harbor breaks down on unstructured data—exactly where AI systems live. Consider this clinical note:
"Patient presented on 7/8/2023 with acute confusion and tremors. Born March 1941 in rural Montana. Works as cardiologist at St. Claire Regional Hospital in Billings. Admitted February 2023 for alcohol withdrawal. Wife reported last drink 48 hours prior. Patient: Robert Kelleher."
Safe Harbor requires removing:
- Dates (present — 7/8/2023 and February 2023)
- Name (present)
- Institution name (present — St. Claire Regional Hospital is identifiable)
- Geographic detail smaller than state (present — "rural Montana" and "Billings")
- Age-related information (present — birth month and year)
After removing all Safe Harbor identifiers, what's left? "Patient presented with acute confusion and tremors. Works as a cardiologist. Admitted for alcohol withdrawal. Wife reported last drink 48 hours prior.
The clinical utility has degraded significantly. More problematically, in unstructured text, Safe Harbor also requires removing text that suggests an identity—a person's job title, if rare, can identify them. A doctor at a small hospital. A researcher at a niche lab. These contextual identifiers don't appear on the 18-item checklist, but they can re-identify patients in combination with other data.
That's why de-identification for AI typically requires Expert Determination instead.
Expert Determination method
Expert Determination takes a different approach: instead of a checklist, you hire a qualified expert to evaluate whether a dataset has a low probability of re-identification. If the expert signs off, HIPAA considers the data de-identified—no removal of specific fields required.
An expert must have expertise in:
- Statistical de-identification methods and re-identification risk assessment
- Health data and healthcare systems (or the specific context of the data)
- Regulations and compliance requirements
This doesn't mean the expert removes identifiers carefully—it means the expert applies statistical methods to assess re-identification risk and documents their professional judgment.
Why Expert Determination is better for AI:
Expert Determination allows selective removal of identifiers while preserving clinical nuance. Instead of removing all dates, you might shift dates by a consistent offset—3/15/2023 becomes 6/18/2023, but the relationships between events are preserved. Instead of removing institutional names, you might replace them with generic terms ("teaching hospital" instead of "Boston Medical Center") that retain context without identifying the institution.
This is why AI teams building on clinical data almost always use Expert Determination: you keep the data quality needed for model performance while meeting HIPAA requirements.
The tradeoff: Expert Determination requires hiring a qualified expert, documenting their methodology, and obtaining a formal report. Safe Harbor is free and instant. Expert Determination costs money and takes time. But if you're building AI on unstructured clinical data, the cost is unavoidable—Safe Harbor won't preserve enough data quality.
Safe Harbor vs. Expert Determination: At a glance
| Safe Harbor | Expert Determination | |
|---|---|---|
| Approach | Remove 18 specific identifiers from the dataset | Qualified expert statistically assesses re-identification probability |
| Best for | Structured data with clearly defined fields | Unstructured data, AI/ML training sets, complex clinical narratives |
| Clinical utility | Lower—removes dates, institution names, and contextual detail | Higher—preserves nuance through date-shifting and generalization |
| Cost | Free; no expert required | Requires qualified expert; allow 2–6 weeks |
| Documentation | De-identification specification listing fields removed | Formal expert report documenting methodology and risk assessment |
| Key limitation | Fails on unstructured text; does not assess real-world re-ID risk | Higher cost and time; expert credentials must be verifiable |
Building a HIPAA-compliant AI pipeline: Step by step
Now that you understand the de-identification standards, let's look at how to structure an AI pipeline that meets them.
Step 1: Data ingestion and initial handling
When you receive PHI—clinical notes, imaging reports, call transcripts—it's immediately identifiable. From that point forward, HIPAA requires you to treat it as if it's a patient's actual health record. That means:
- Limit access: Only team members who need to handle raw PHI can access it
- Encrypt in transit: Use TLS 1.2+ for any data movement
- Encrypt at rest: Any storage of raw PHI must be encrypted
- Audit logs: Log every access to raw PHI, including who accessed it, when, and what they did
- Secure environment: Store raw PHI only in environments you control—ideally on-premises or in a VPC where you manage access controls
Do not send raw PHI to cloud storage buckets. Do not email it. Do not share it with external vendors unless they're signed business associates.
Step 2: De-identification preparation
Before de-identification, decide which approach you'll take:
If Safe Harbor is viable (which is rare for AI datasets):
- Document which of the 18 identifiers are present
- Plan removal or redaction of each
- Create a de-identification specification document listing every field and rule
- Plan for manual review of unstructured data (Safe Harbor on text requires human judgment)
If Expert Determination is needed (which is typical):
- Engage a qualified de-identification expert early—don't wait until you have data
- Discuss your specific data types (clinical notes, call transcripts, EHR exports) and use cases (AI training, analytics, research)
- Work with the expert to design a de-identification approach that balances utility and risk
- Establish the timeline and cost (allow 2–4 weeks for methodology development and 1–3 weeks for data processing)
Step 3: Execute de-identification
De-identification happens in a controlled environment—a dedicated system where raw PHI is processed under strict controls.
For unstructured data (clinical notes, transcripts):
Use de-identification software that handles NLP-based PII detection. General-purpose tools built for credit card and SSN redaction miss healthcare-specific identifiers—clinician names, patient initials embedded in narrative text, rare diseases that identify patients. A healthcare-focused tool should:
- 50+ entity types covering PII, PHI, and PCI; and the PHI coverage includes healthcare-specific identifiers
- Handle multiple languages (if needed)
- Provide accuracy benchmarks on healthcare data (look for tools with 95%+ recall on real clinical data, not just test sets)
- Support Expert Determination by preserving replacements (replacing names with generic terms, dates with shifted dates) rather than blanking them out
For structured data (EHR databases, claims):
- Remove or transform the 18 HIPAA identifiers
- Aggregate or suppress rare values that could re-identify (a patient with a rare diagnosis at a small hospital is identifiable, even if their name is removed)
- Document the transformation rules for the Expert Determination report
Validation:
- Conduct manual spot-checking: randomly select 50–100 records and verify no PHI remains visible
- Run the de-identified data through a PII detector to catch edge cases
- For Expert Determination, the expert should validate the de-identification before signing off
Step 4: Move de-identified data to AI environment
Once data is de-identified according to your chosen method (Safe Harbor or Expert Determination), it's no longer PHI. You can now:
- Copy it freely to development, staging, and production environments
- Share it with team members without access restrictions
- Use it for model training, validation, and testing
- Store it in standard cloud environments (S3, Azure Storage, etc.)
This is the critical transition point: raw PHI is locked down; de-identified data is accessible.
However: Keep PHI and de-identified data separate. Don't store raw PHI in the same environment as de-identified data. Don't mix datasets. If de-identified data becomes re-linked with identifiers (through a join or linkage), it reverts to PHI status and HIPAA restrictions apply again.
Step 5: Model training and validation
Once you're working with de-identified training data, HIPAA restrictions on the data itself are lifted—but you inherit new responsibilities. For a full breakdown of de-identification strategies specific to AI model development, see our guide on de-identification for AI training data.
Prevent data leakage in the model:
- Don't use any raw identifiers (names, MRNs, dates) as features, even if they're in the de-identified dataset
- Before training, verify the model doesn't memorize or regurgitate training data—especially problematic for generative models (LLMs fine-tuned on patient notes can hallucinate or recall specific training examples)
- Use differential privacy or federated learning if you're training on particularly sensitive data—these add mathematical guarantees that the model doesn't leak individual records
Document your training process:
- Record what de-identification method was used
- Maintain the Expert Determination report (if applicable) as part of your compliance documentation
- Track data lineage: where the training data came from, what transformations were applied, who had access
Step 6: Deployment and inference
When your AI system goes live and processes new patient data, that data is PHI again—even if it's only flowing through your model for inference.
For real-time or batch inference:
- If the system processes identifiable patient data (names, MRNs), limit access to authorized users
- Encrypt data in transit
- Log inference requests
- Don't store raw inputs unnecessarily—delete inference data according to your retention policy
- If results are logged, don't log them together with identifiers (don't create a table with patient name + model output; keep them separate)
For model outputs:
- Document what information the model outputs and who can see it
- If outputs are stored, encrypt them
- Apply the same access controls as any other patient data
This is where many organizations slip up: they de-identify training data carefully, build a great model, and then deploy it into an environment that's wide open to the entire hospital. The model itself isn't a privacy risk—but the deployment environment is.
Step 7: Ongoing monitoring and documentation
HIPAA requires you to document your compliance approach. Create and maintain:
- De-identification methodology document: How you de-identified the training data (Safe Harbor rules applied, or Expert Determination approach)
- Expert Determination report (if applicable): The expert's written assessment of re-identification risk
- Data handling procedures: How raw PHI is ingested, where it's stored, who can access it, how long it's retained
- Access logs: Proof that you're monitoring who touches sensitive data
- Breach response plan: What you'll do if PHI is compromised (HIPAA requires notification to affected individuals within 60 days)
These documents won't prevent violations—but they demonstrate you took reasonable steps to comply, which matters in enforcement actions.
Common pitfalls and how to avoid them
Pitfall 1: Assuming cloud AI services handle HIPAA compliance
Major cloud providers (AWS, Azure, Google Cloud) offer HIPAA-compliant services—but only if you configure them correctly. HIPAA compliance is a shared responsibility:
- The cloud provider secures the infrastructure
- You secure your data and configuration
Many teams upload PHI to a standard S3 bucket or Vertex AI dataset, expecting the cloud provider to handle compliance. The cloud provider can't see what's in your bucket or dataset—compliance is your job.
How to avoid it: Use HIPAA-BAA (Business Associate Agreement) services when working with identifiable data. For AWS, this means S3 with encryption, not a standard bucket. For Azure, it means HIPAA-eligible services with proper configuration. Or, deploy your de-identification and AI pipeline on-premises or in a private VPC.
Pitfall 2: De-identifying at the wrong stage
Some teams de-identify very early (right after ingestion), losing valuable data. Others de-identify too late (after training), exposing PHI throughout development.
The right approach:
- Accept raw PHI only in a restricted environment
- De-identify immediately—before sharing with development teams
- Share de-identified data freely
De-identification is not something to put off. Do it as early as possible.
Pitfall 3: Not considering future re-identification risk
Safe Harbor removes 18 identifiers, but that doesn't prevent re-identification. A dataset with age 45, female, diagnosed with rare cancer in rural Montana might be re-identifiable through linkage with public databases. Safe Harbor assumes you'll use the de-identified data in isolation—if you plan to link it with other databases or publish it, re-identification risk increases.
Expert Determination explicitly accounts for this. The expert assesses the risk of re-identification given the data's intended use.
How to avoid it: If you're building on clinical data, assume Expert Determination is needed. Document the intended use of the data (training a model, publishing research, etc.) and let the expert assess risk accordingly.
Pitfall 4: Confusing de-identification with anonymization
De-identification and anonymization are not the same under HIPAA:
- De-identified data: Information that had identifiers removed, but could theoretically be re-identified with additional effort
- Anonymized data: Information that cannot be re-identified, even with external data sources (and is therefore outside HIPAA's scope entirely)
Safe Harbor and Expert Determination both produce de-identified data. If an expert determines that re-identification is impossible, the data becomes anonymized and HIPAA doesn't apply. Most healthcare de-identification is not anonymization—it's de-identification that remains HIPAA-regulated if re-linked.
For AI purposes, this distinction matters: if your training data is de-identified (not anonymized), you must still document the de-identification and maintain it in your compliance records.
Pitfall 5: Focusing only on data, ignoring the model
The industry tends to focus on de-identifying training data and then forget that models can leak information. A language model fine-tuned on de-identified clinical notes can, under certain conditions, regurgitate training data—including elements the de-identification process didn't catch.
Mitigation strategies:
- Use smaller, more specific models rather than large general-purpose models when possible
- Add differential privacy to training to provide mathematical privacy guarantees
- Avoid publicly deploying models trained on healthcare data without additional safeguards
- For generative models, be cautious—fine-tuning on sensitive data carries higher re-identification risk
Pitfall 6: Not involving legal and compliance early
De-identification methodology, Safe Harbor vs. Expert Determination, whether your model is considered a "covered entity" under HIPAA—these are partly legal questions. Many AI teams solve them technically without legal input, then discover months later that the approach doesn't hold up to compliance standards.
How to avoid it: Involve your legal or compliance team during planning, not after. They should sign off on your de-identification approach before you process data. If your organization operates across the US and EU, also review how these requirements differ from GDPR obligations—our HIPAA vs. GDPR comparison breaks down the key practical differences for global organizations.
Practical checklist: Building compliant AI
Use this checklist to validate your compliance approach before going live:
- Identified all sources of PHI (clinical notes, EHR exports, audio, etc.)
- Determined whether Safe Harbor or Expert Determination applies
- If Expert Determination, engaged a qualified expert and documented their methodology
- Set up a restricted, encrypted environment for raw PHI ingestion
- Implemented de-identification before sharing data with development teams
- Validated de-identified data (spot-checked manually, ran through PII detection)
- Documented de-identification rules and obtained Expert Determination report (if applicable)
- Configured cloud services with HIPAA-compliant controls (encryption, VPC, BAAs if needed)
- Prevented data leakage in model training (no raw identifiers as features)
- Deployed the model in a controlled environment with access logs
- Maintained compliance documentation and audit trails
- Established a data retention policy and breach notification plan
- Reviewed the approach with legal/compliance before going live
Start building compliant AI
HIPAA compliance for AI isn't a box to check—it's an ongoing responsibility embedded in your data handling, de-identification, model training, and deployment processes. Get it right, and you unlock healthcare data's enormous potential without exposing your organization to breach, penalties, and loss of patient trust.
The roadmap is clear:
- De-identify PHI appropriately (Safe Harbor or Expert Determination)
- Control access to raw data
- Build models on clean, de-identified data
- Deploy carefully and audit thoroughly
- Document everything
Ready to move forward? Get a demo of Limina's de-identification platform to see how healthcare organizations streamline HIPAA compliance for AI at scale.
For comprehensive coverage of compliance methods and regulations, read our complete HIPAA de-identification guide.
Frequently Asked Questions
What’s the difference between a business associate and a covered entity?
A covered entity is a healthcare provider, health plan, or clearinghouse—regulated directly by HIPAA. A business associate is any organization that processes PHI on behalf of a covered entity. If you’re building AI for a hospital, you’re likely a business associate, which means HIPAA applies to you. The hospital (covered entity) must have a Business Associate Agreement (BAA) with you that specifies how you’ll handle PHI.
Can we use Safe Harbor if the clinical notes contain rare diseases?
Technically yes, if you remove all 18 identifiers. But rare diseases can re-identify patients in combination with other available data. Safe Harbor legally de-identifies the data within HIPAA, but the data may still be re-identifiable in practice. For AI, Expert Determination is more appropriate because the expert assesses real-world re-identification risk specific to your use case.
How long can we keep the raw PHI during de-identification?
Keep it for as long as the de-identification process requires. Document a timeline. Once de-identification is complete, delete the raw PHI unless you have a specific retention reason (like backup or recovery). HIPAA requires you to keep PHI for a clinically appropriate retention period—don’t keep it longer than necessary.
If we use a cloud de-identification service, who’s responsible for HIPAA compliance?
You are. The service is a vendor (business associate). You’re responsible for ensuring it’s configured correctly, encrypted, and auditable. Require the vendor to sign a BAA and demonstrate HIPAA compliance. If the vendor stores data in a shared environment, ensure your data is encrypted—the vendor can’t see it.
Can we add differential privacy to our model instead of de-identifying training data?
Differential privacy provides mathematical privacy guarantees—a model trained with differential privacy will not reveal whether any individual was in the training set. This is powerful, but HIPAA requires de-identified training data by default. You can use differential privacy as an additional safeguard, but don’t rely on it alone to meet HIPAA requirements.
What happens if we accidentally expose a model trained on de-identified data?
The de-identification status doesn’t change just because the model is exposed. If the model was trained on properly de-identified data (Safe Harbor or Expert Determination), publicly releasing the model doesn’t violate HIPAA. However, security best practice suggests keeping models confidential anyway—they can be reverse-engineered in some cases. More importantly, if the model inference pipeline processes live patient data, that’s PHI and must be protected.
How often do we need to refresh the Expert Determination report?
If your de-identification methodology and use case don’t change, the report remains valid indefinitely. If you change the data source, the de-identification approach, or how you plan to use the data, obtain a new report. Annual reviews are a best practice even if nothing has changed—they document ongoing diligence.

