May 31, 2026

HIPAA Compliance for AI: What Healthcare AI Builders Need to Know

Q: If we use a cloud de-identification service, who is responsible for HIPAA compliance?

You are. The service is a vendor and business associate. You are responsible for ensuring it is configured correctly, encrypted, and auditable. Require the vendor to sign a BAA and demonstrate HIPAA compliance. If the vendor stores data in a shared environment, ensure your data is encrypted so the vendor cannot access it.

Q: Can we add differential privacy to our model instead of de-identifying training data?

Differential privacy provides mathematical guarantees that a model will not reveal whether any individual was in the training set. This is powerful, but HIPAA requires de-identified training data by default. Use differential privacy as an additional safeguard, but do not rely on it alone to meet HIPAA requirements.

Q: What happens if we accidentally expose a model trained on de-identified data?

The de-identification status does not change just because the model is exposed. If the model was trained on properly de-identified data under Safe Harbor or Expert Determination, publicly releasing the model does not violate HIPAA. However, if the inference pipeline processes live patient data, that data is PHI and must be protected. Keeping models confidential is a security best practice regardless.

Q: How often do we need to refresh the Expert Determination report?

If your de-identification methodology and use case do not change, the report remains valid indefinitely. Obtain a new report if you change the data source, the de-identification approach, or how you plan to use the data. Annual reviews are a best practice even if nothing has changed, as they document ongoing diligence.

HIPAA compliance for AI is the practice of ensuring that artificial intelligence systems—including model training pipelines, inference services, and data workflows—meet the requirements of the Health Insurance Portability and Accountability Act. Specifically, this means applying HIPAA’s Privacy Rule, Security Rule, and de-identification standards to every stage of AI development that touches Protected Health Information (PHI).

Limina

Company

You're building AI that can transform healthcare—diagnostic assistants, clinical decision support, workflow automation. But the moment you add real patient data to that model, you've entered regulated territory. HIPAA compliance for AI isn't optional; it's a legal obligation that carries penalties up to $1.5 million per violation category, per year.

The challenge? HIPAA compliance rules were written before modern AI existed. They don't mention machine learning pipelines, large language models, or the specific risks that come with training AI on sensitive health data. That's left healthcare AI builders in a bind: follow 1990s compliance guidance with 2020s technology, or risk substantial penalties.

This guide walks you through what HIPAA actually requires for AI systems, how to interpret those requirements for modern workflows, and practical steps to build compliant AI without sacrificing functionality or performance.

What this article covers:

How HIPAA applies to AI systems and training data
De-identification standards that reduce risk
The two paths to HIPAA compliance: Safe Harbor and Expert Determination
Building compliant AI pipelines from data intake to model deployment
Common pitfalls and how to avoid them

Understanding HIPAA and Protected Health Information

What HIPAA actually regulates

The Health Insurance Portability and Accountability Act (HIPAA) doesn't restrict what you do with health data—it restricts what happens when that data is linked to individuals. Once health data is identifiable—meaning someone could reasonably figure out whose information it is—it becomes Protected Health Information (PHI). Once it's PHI, HIPAA applies.

The law covers three groups:

Covered entities: healthcare providers (doctors, hospitals, clinics), health plans (insurance companies), and healthcare clearinghouses
Business associates: any organization that processes, stores, or handles PHI on behalf of a covered entity
Subcontractors: vendors hired by business associates

If you're building AI for a hospital, you're likely a business associate or contractor. That means HIPAA compliance is your responsibility, not just theirs.

What counts as PHI in AI contexts

PHI includes anything that could identify a patient: names, medical record numbers, dates of birth, diagnoses, medications, appointment notes, lab results, imaging descriptions, and clinical narratives. But in unstructured data—free-text notes, call transcripts, audio recordings—PHI is often buried inside natural language, where standard tools miss it.

Here's the practical reality: a hospital's AI team trains a model on clinical notes to predict patient readmission. Those notes contain PHI. If the notes aren't de-identified first, the hospital has just violated HIPAA by creating a dataset where patients are identifiable. The violation occurred during training, before any deployment.

Why AI creates new compliance risks

Traditional healthcare IT systems are controlled environments: data flows through audited systems, access is logged, and retention periods are defined. AI systems add complexity:

Data multiplication: Training data is copied to development, staging, and production environments—each a potential exposure point
Model leakage: Trained models can memorize and regurgitate training data, including PHI, especially when fine-tuned on sensitive datasets
Inference risks: When an AI system processes live patient data (even if trained on de-identified data), it generates new records that may be identifiable
Auditability: It's harder to prove how a model arrived at a decision, making forensic investigation of breaches more difficult

HIPAA compliance for AI isn't just about de-identifying training data—it's about controlling data flow across the entire lifecycle.

The two paths to HIPAA compliance: Safe Harbor and Expert Determination

HIPAA gives you two ways to transform PHI into de-identified information that falls outside HIPAA's scope entirely. Understanding the difference is critical because each path has different technical and resource requirements.

Safe Harbor method

Safe Harbor is a checklist approach: remove 18 specific identifiers from a dataset, and HIPAA considers the data de-identified. No further proof needed.

The 18 HIPAA identifiers to remove:

Names
Geographic subdivisions smaller than a state, including street address, city, county, and zip code
Dates directly related to an individual (except year)—including birth date, admission and discharge dates, and dates of death; for individuals 90 years or older, all dates including year
Phone numbers
Fax numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate and license numbers
Vehicle identifiers and serial numbers, including license plate numbers
Device identifiers and serial numbers
Web URLs
IP addresses
Biometric identifiers, including fingerprints and voiceprints
Full-face photographs and comparable images
Any other unique identifying number, characteristic, or code

When Safe Harbor works for AI:

Safe Harbor is straightforward for structured data. A hospital dataset with columns for diagnosis, medication, lab value, and date? Remove the date, aggregate patient IDs, and you're done. Safe Harbor applies.

When Safe Harbor doesn't work:

Safe Harbor breaks down on unstructured data—exactly where AI systems live. Consider this clinical note:

"Patient presented on 7/8/2023 with acute confusion and tremors. Born March 1941 in rural Montana. Works as cardiologist at St. Claire Regional Hospital in Billings. Admitted February 2023 for alcohol withdrawal. Wife reported last drink 48 hours prior. Patient: Robert Kelleher."

Safe Harbor requires removing:

Dates (present — 7/8/2023 and February 2023)
Name (present)
Institution name (present — St. Claire Regional Hospital is identifiable)
Geographic detail smaller than state (present — "rural Montana" and "Billings")
Age-related information (present — birth month and year)

After removing all Safe Harbor identifiers, what's left? "Patient presented with acute confusion and tremors. Works as a cardiologist. Admitted for alcohol withdrawal. Wife reported last drink 48 hours prior.

The clinical utility has degraded significantly. More problematically, in unstructured text, Safe Harbor also requires removing text that suggests an identity—a person's job title, if rare, can identify them. A doctor at a small hospital. A researcher at a niche lab. These contextual identifiers don't appear on the 18-item checklist, but they can re-identify patients in combination with other data.

That's why de-identification for AI typically requires Expert Determination instead.

Expert Determination method

Expert Determination takes a different approach: instead of a checklist, you hire a qualified expert to evaluate whether a dataset has a low probability of re-identification. If the expert signs off, HIPAA considers the data de-identified—no removal of specific fields required.

An expert must have expertise in:

Statistical de-identification methods and re-identification risk assessment
Health data and healthcare systems (or the specific context of the data)
Regulations and compliance requirements

This doesn't mean the expert removes identifiers carefully—it means the expert applies statistical methods to assess re-identification risk and documents their professional judgment.

Why Expert Determination is better for AI:

Expert Determination allows selective removal of identifiers while preserving clinical nuance. Instead of removing all dates, you might shift dates by a consistent offset—3/15/2023 becomes 6/18/2023, but the relationships between events are preserved. Instead of removing institutional names, you might replace them with generic terms ("teaching hospital" instead of "Boston Medical Center") that retain context without identifying the institution.

This is why AI teams building on clinical data almost always use Expert Determination: you keep the data quality needed for model performance while meeting HIPAA requirements.

The tradeoff: Expert Determination requires hiring a qualified expert, documenting their methodology, and obtaining a formal report. Safe Harbor is free and instant. Expert Determination costs money and takes time. But if you're building AI on unstructured clinical data, the cost is unavoidable—Safe Harbor won't preserve enough data quality.

Safe Harbor vs. Expert Determination: At a glance

	Safe Harbor	Expert Determination
Approach	Remove 18 specific identifiers from the dataset	Qualified expert statistically assesses re-identification probability
Best for	Structured data with clearly defined fields	Unstructured data, AI/ML training sets, complex clinical narratives
Clinical utility	Lower—removes dates, institution names, and contextual detail	Higher—preserves nuance through date-shifting and generalization
Cost	Free; no expert required	Requires qualified expert; allow 2–6 weeks
Documentation	De-identification specification listing fields removed	Formal expert report documenting methodology and risk assessment
Key limitation	Fails on unstructured text; does not assess real-world re-ID risk	Higher cost and time; expert credentials must be verifiable

Building a HIPAA-compliant AI pipeline: Step by step

Now that you understand the de-identification standards, let's look at how to structure an AI pipeline that meets them.

Step 1: Data ingestion and initial handling

When you receive PHI—clinical notes, imaging reports, call transcripts—it's immediately identifiable. From that point forward, HIPAA requires you to treat it as if it's a patient's actual health record. That means:

Limit access: Only team members who need to handle raw PHI can access it
Encrypt in transit: Use TLS 1.2+ for any data movement
Encrypt at rest: Any storage of raw PHI must be encrypted
Audit logs: Log every access to raw PHI, including who accessed it, when, and what they did
Secure environment: Store raw PHI only in environments you control—ideally on-premises or in a VPC where you manage access controls

Do not send raw PHI to cloud storage buckets. Do not email it. Do not share it with external vendors unless they're signed business associates.

Step 2: De-identification preparation

Before de-identification, decide which approach you'll take:

If Safe Harbor is viable (which is rare for AI datasets):

Document which of the 18 identifiers are present
Plan removal or redaction of each
Create a de-identification specification document listing every field and rule
Plan for manual review of unstructured data (Safe Harbor on text requires human judgment)

If Expert Determination is needed (which is typical):

Engage a qualified de-identification expert early—don't wait until you have data
Discuss your specific data types (clinical notes, call transcripts, EHR exports) and use cases (AI training, analytics, research)
Work with the expert to design a de-identification approach that balances utility and risk
Establish the timeline and cost (allow 2–4 weeks for methodology development and 1–3 weeks for data processing)

Step 3: Execute de-identification

De-identification happens in a controlled environment—a dedicated system where raw PHI is processed under strict controls.

For unstructured data (clinical notes, transcripts):

Use de-identification software that handles NLP-based PII detection. General-purpose tools built for credit card and SSN redaction miss healthcare-specific identifiers—clinician names, patient initials embedded in narrative text, rare diseases that identify patients. A healthcare-focused tool should:

50+ entity types covering PII, PHI, and PCI; and the PHI coverage includes healthcare-specific identifiers
Handle multiple languages (if needed)
Provide accuracy benchmarks on healthcare data (look for tools with 95%+ recall on real clinical data, not just test sets)
Support Expert Determination by preserving replacements (replacing names with generic terms, dates with shifted dates) rather than blanking them out

For structured data (EHR databases, claims):

Remove or transform the 18 HIPAA identifiers
Aggregate or suppress rare values that could re-identify (a patient with a rare diagnosis at a small hospital is identifiable, even if their name is removed)
Document the transformation rules for the Expert Determination report

Validation:

Conduct manual spot-checking: randomly select 50–100 records and verify no PHI remains visible
Run the de-identified data through a PII detector to catch edge cases
For Expert Determination, the expert should validate the de-identification before signing off

Step 4: Move de-identified data to AI environment

Once data is de-identified according to your chosen method (Safe Harbor or Expert Determination), it's no longer PHI. You can now:

Copy it freely to development, staging, and production environments
Share it with team members without access restrictions
Use it for model training, validation, and testing
Store it in standard cloud environments (S3, Azure Storage, etc.)

This is the critical transition point: raw PHI is locked down; de-identified data is accessible.

However: Keep PHI and de-identified data separate. Don't store raw PHI in the same environment as de-identified data. Don't mix datasets. If de-identified data becomes re-linked with identifiers (through a join or linkage), it reverts to PHI status and HIPAA restrictions apply again.

Step 5: Model training and validation

Once you're working with de-identified training data, HIPAA restrictions on the data itself are lifted—but you inherit new responsibilities. For a full breakdown of de-identification strategies specific to AI model development, see our guide on de-identification for AI training data.

Prevent data leakage in the model:

Don't use any raw identifiers (names, MRNs, dates) as features, even if they're in the de-identified dataset
Before training, verify the model doesn't memorize or regurgitate training data—especially problematic for generative models (LLMs fine-tuned on patient notes can hallucinate or recall specific training examples)
Use differential privacy or federated learning if you're training on particularly sensitive data—these add mathematical guarantees that the model doesn't leak individual records

Document your training process:

Record what de-identification method was used
Maintain the Expert Determination report (if applicable) as part of your compliance documentation
Track data lineage: where the training data came from, what transformations were applied, who had access

Step 6: Deployment and inference

When your AI system goes live and processes new patient data, that data is PHI again—even if it's only flowing through your model for inference.

For real-time or batch inference:

If the system processes identifiable patient data (names, MRNs), limit access to authorized users
Encrypt data in transit
Log inference requests
Don't store raw inputs unnecessarily—delete inference data according to your retention policy
If results are logged, don't log them together with identifiers (don't create a table with patient name + model output; keep them separate)

For model outputs:

Document what information the model outputs and who can see it
If outputs are stored, encrypt them
Apply the same access controls as any other patient data

This is where many organizations slip up: they de-identify training data carefully, build a great model, and then deploy it into an environment that's wide open to the entire hospital. The model itself isn't a privacy risk—but the deployment environment is.

Step 7: Ongoing monitoring and documentation

HIPAA requires you to document your compliance approach. Create and maintain:

De-identification methodology document: How you de-identified the training data (Safe Harbor rules applied, or Expert Determination approach)
Expert Determination report (if applicable): The expert's written assessment of re-identification risk
Data handling procedures: How raw PHI is ingested, where it's stored, who can access it, how long it's retained
Access logs: Proof that you're monitoring who touches sensitive data
Breach response plan: What you'll do if PHI is compromised (HIPAA requires notification to affected individuals within 60 days)

These documents won't prevent violations—but they demonstrate you took reasonable steps to comply, which matters in enforcement actions.

Common pitfalls and how to avoid them

Pitfall 1: Assuming cloud AI services handle HIPAA compliance

Major cloud providers (AWS, Azure, Google Cloud) offer HIPAA-compliant services—but only if you configure them correctly. HIPAA compliance is a shared responsibility:

The cloud provider secures the infrastructure
You secure your data and configuration

Many teams upload PHI to a standard S3 bucket or Vertex AI dataset, expecting the cloud provider to handle compliance. The cloud provider can't see what's in your bucket or dataset—compliance is your job.

How to avoid it: Use HIPAA-BAA (Business Associate Agreement) services when working with identifiable data. For AWS, this means S3 with encryption, not a standard bucket. For Azure, it means HIPAA-eligible services with proper configuration. Or, deploy your de-identification and AI pipeline on-premises or in a private VPC.

Pitfall 2: De-identifying at the wrong stage

Some teams de-identify very early (right after ingestion), losing valuable data. Others de-identify too late (after training), exposing PHI throughout development.

The right approach:

Accept raw PHI only in a restricted environment
De-identify immediately—before sharing with development teams
Share de-identified data freely

De-identification is not something to put off. Do it as early as possible.

Pitfall 3: Not considering future re-identification risk

Safe Harbor removes 18 identifiers, but that doesn't prevent re-identification. A dataset with age 45, female, diagnosed with rare cancer in rural Montana might be re-identifiable through linkage with public databases. Safe Harbor assumes you'll use the de-identified data in isolation—if you plan to link it with other databases or publish it, re-identification risk increases.

Expert Determination explicitly accounts for this. The expert assesses the risk of re-identification given the data's intended use.

How to avoid it: If you're building on clinical data, assume Expert Determination is needed. Document the intended use of the data (training a model, publishing research, etc.) and let the expert assess risk accordingly.

Pitfall 4: Confusing de-identification with anonymization

De-identification and anonymization are not the same under HIPAA:

De-identified data: Information that had identifiers removed, but could theoretically be re-identified with additional effort
Anonymized data: Information that cannot be re-identified, even with external data sources (and is therefore outside HIPAA's scope entirely)

Safe Harbor and Expert Determination both produce de-identified data. If an expert determines that re-identification is impossible, the data becomes anonymized and HIPAA doesn't apply. Most healthcare de-identification is not anonymization—it's de-identification that remains HIPAA-regulated if re-linked.

For AI purposes, this distinction matters: if your training data is de-identified (not anonymized), you must still document the de-identification and maintain it in your compliance records.

Pitfall 5: Focusing only on data, ignoring the model

The industry tends to focus on de-identifying training data and then forget that models can leak information. A language model fine-tuned on de-identified clinical notes can, under certain conditions, regurgitate training data—including elements the de-identification process didn't catch.

Mitigation strategies:

Use smaller, more specific models rather than large general-purpose models when possible
Add differential privacy to training to provide mathematical privacy guarantees
Avoid publicly deploying models trained on healthcare data without additional safeguards
For generative models, be cautious—fine-tuning on sensitive data carries higher re-identification risk

Pitfall 6: Not involving legal and compliance early

De-identification methodology, Safe Harbor vs. Expert Determination, whether your model is considered a "covered entity" under HIPAA—these are partly legal questions. Many AI teams solve them technically without legal input, then discover months later that the approach doesn't hold up to compliance standards.

How to avoid it: Involve your legal or compliance team during planning, not after. They should sign off on your de-identification approach before you process data. If your organization operates across the US and EU, also review how these requirements differ from GDPR obligations—our HIPAA vs. GDPR comparison breaks down the key practical differences for global organizations.

Practical checklist: Building compliant AI

Use this checklist to validate your compliance approach before going live:

Identified all sources of PHI (clinical notes, EHR exports, audio, etc.)
Determined whether Safe Harbor or Expert Determination applies
If Expert Determination, engaged a qualified expert and documented their methodology
Set up a restricted, encrypted environment for raw PHI ingestion
Implemented de-identification before sharing data with development teams
Validated de-identified data (spot-checked manually, ran through PII detection)
Documented de-identification rules and obtained Expert Determination report (if applicable)
Configured cloud services with HIPAA-compliant controls (encryption, VPC, BAAs if needed)
Prevented data leakage in model training (no raw identifiers as features)
Deployed the model in a controlled environment with access logs
Maintained compliance documentation and audit trails
Established a data retention policy and breach notification plan
Reviewed the approach with legal/compliance before going live

Start building compliant AI

HIPAA compliance for AI isn't a box to check—it's an ongoing responsibility embedded in your data handling, de-identification, model training, and deployment processes. Get it right, and you unlock healthcare data's enormous potential without exposing your organization to breach, penalties, and loss of patient trust.

The roadmap is clear:

De-identify PHI appropriately (Safe Harbor or Expert Determination)
Control access to raw data
Build models on clean, de-identified data
Deploy carefully and audit thoroughly
Document everything

Ready to move forward? Get a demo of Limina's de-identification platform to see how healthcare organizations streamline HIPAA compliance for AI at scale.

For comprehensive coverage of compliance methods and regulations, read our complete HIPAA de-identification guide.

Share this post

Copy link

Frequently Asked Questions

What’s the difference between a business associate and a covered entity?

A covered entity is a healthcare provider, health plan, or clearinghouse—regulated directly by HIPAA. A business associate is any organization that processes PHI on behalf of a covered entity. If you’re building AI for a hospital, you’re likely a business associate, which means HIPAA applies to you. The hospital (covered entity) must have a Business Associate Agreement (BAA) with you that specifies how you’ll handle PHI.

Can we use Safe Harbor if the clinical notes contain rare diseases?

Technically yes, if you remove all 18 identifiers. But rare diseases can re-identify patients in combination with other available data. Safe Harbor legally de-identifies the data within HIPAA, but the data may still be re-identifiable in practice. For AI, Expert Determination is more appropriate because the expert assesses real-world re-identification risk specific to your use case.

How long can we keep the raw PHI during de-identification?

Keep it for as long as the de-identification process requires. Document a timeline. Once de-identification is complete, delete the raw PHI unless you have a specific retention reason (like backup or recovery). HIPAA requires you to keep PHI for a clinically appropriate retention period—don’t keep it longer than necessary.

If we use a cloud de-identification service, who’s responsible for HIPAA compliance?

You are. The service is a vendor (business associate). You’re responsible for ensuring it’s configured correctly, encrypted, and auditable. Require the vendor to sign a BAA and demonstrate HIPAA compliance. If the vendor stores data in a shared environment, ensure your data is encrypted—the vendor can’t see it.

Can we add differential privacy to our model instead of de-identifying training data?

Differential privacy provides mathematical privacy guarantees—a model trained with differential privacy will not reveal whether any individual was in the training set. This is powerful, but HIPAA requires de-identified training data by default. You can use differential privacy as an additional safeguard, but don’t rely on it alone to meet HIPAA requirements.

What happens if we accidentally expose a model trained on de-identified data?

The de-identification status doesn’t change just because the model is exposed. If the model was trained on properly de-identified data (Safe Harbor or Expert Determination), publicly releasing the model doesn’t violate HIPAA. However, security best practice suggests keeping models confidential anyway—they can be reverse-engineered in some cases. More importantly, if the model inference pipeline processes live patient data, that’s PHI and must be protected.

How often do we need to refresh the Expert Determination report?

If your de-identification methodology and use case don’t change, the report remains valid indefinitely. If you change the data source, the de-identification approach, or how you plan to use the data, obtain a new report. Annual reviews are a best practice even if nothing has changed—they document ongoing diligence.