Can you use HIPAA Safe Harbor for AI and machine learning?

Yes. De-identified data produced under the HIPAA Safe Harbor method can be used for AI model training, analytics and secondary research without triggering HIPAA restrictions. However, the accuracy of de-identification matters enormously. Tools that miss identifiers in unstructured text can leave PHI in what appears to be a clean training dataset.

May 28, 2026

HIPAA Safe Harbor Method: A Complete Step-by-Step Guide

Q: What is the difference between HIPAA Safe Harbor and GDPR anonymization?

HIPAA Safe Harbor is a prescriptive checklist requiring removal of 18 specific identifier categories. GDPR anonymization requires that re-identification be reasonably unlikely given all available means. The two standards do not automatically satisfy each other. Organizations operating in both the US and EU must assess data against both frameworks independently.

Q: Is the Safe Harbor method suitable for rare conditions or small patient populations?

Safe Harbor can be problematic for datasets with rare diagnoses or small patient subgroups, where retained fields can make individuals re-identifiable even after removing all 18 identifiers. In these cases, Expert Determination—which applies statistical re-identification risk modeling—is typically more appropriate and produces a formal compliance report.

Q: What documentation should you keep after applying Safe Harbor?

Document the methodology applied, the data sources and fields processed, the tools used for each identifier category, the date the process was completed and the individuals who confirmed the no actual knowledge requirement. This documentation is your audit trail if your de-identification process is reviewed by HHS or examined during a breach investigation.

Q: How does Safe Harbor relate to de-identification approaches like pseudonymization?

Safe Harbor is a prescriptive removal standard focused on eliminating or suppressing identifiers. Pseudonymization replaces identifiers with substitutes that can be reversed with a key. Under HIPAA, pseudonymized data is not de-identified if the key can be used to re-identify individuals. Safe Harbor requires that no such re-identification is possible.

The HIPAA Safe Harbor method is one of two de-identification standards established under 45 CFR §164.514(b) of the HIPAA Privacy Rule. It requires the removal of 18 specific categories of identifiers from Protected Health Information (PHI)—combined with a verification that the covered entity has no actual knowledge that the remaining data could identify an individual. Data that satisfies both conditions is legally de-identified and no longer subject to the Privacy Rule.

Limina

Company

HIPAA Safe Harbor Method: A Complete Step-by-Step Guide

A single unmasked zip code or leftover admission date can be enough to re-identify a patient from a dataset that looks, on its surface, perfectly clean. Privacy researcher Latanya Sweeney demonstrated that 87 percent of Americans could be uniquely identified using just three data points: zip code, birth date and sex. For compliance teams and data engineers working with health data, the HIPAA Safe Harbor method provides a concrete, regulator-defined standard for what de-identification actually requires.

This guide covers every element of the Safe Harbor standard: the full 18-identifier list, the special rules for geographic data and dates, the "no actual knowledge" requirement, a step-by-step process for applying it and a side-by-side comparison with Expert Determination to help you choose the right approach for your use case.

Who the Safe Harbor method applies to

The Safe Harbor method applies to covered entities and their business associates that handle PHI. A covered entity is a health plan, healthcare clearinghouse or healthcare provider that transmits health information electronically. Business associates—organizations that process PHI on behalf of covered entities—are also bound by these requirements under the HIPAA Omnibus Rule.

If you're building AI models, conducting research, sharing data with analytics partners or enabling secondary use of health records, you need a legally defensible de-identification method before any PHI leaves your controlled environment. The HIPAA Safe Harbor method is the more prescriptive of the two available options: follow the 18-identifier checklist, confirm no actual knowledge of re-identification risk and the data is de-identified by regulatory definition.

The 18 identifier categories you must remove

Under 45 CFR §164.514(b)(2), the following identifier categories must be removed from PHI before data can be considered de-identified under the Safe Harbor standard. This list is drawn directly from HHS guidance on de-identification.

#	Identifier category	Common examples
1	Names	First, last and maiden names; initials linked to other data
2	Geographic subdivisions smaller than state	Street address, city, county, zip code (exceptions apply—see below)
3	Dates related to an individual (except year)	Birth date, admission date, discharge date, date of death
4	Telephone numbers	Mobile, home and office numbers
5	Fax numbers	All fax numbers
6	Email addresses	Personal or work email
7	Social Security numbers	Full or partial SSNs
8	Medical record numbers	EHR record IDs
9	Health plan beneficiary numbers	Insurance member IDs
10	Account numbers	Financial or provider account IDs
11	Certificate and license numbers	Medical licenses, DEA registration numbers
12	Vehicle identifiers and serial numbers	License plate numbers, VINs
13	Device identifiers and serial numbers	Pacemaker IDs, implant serial numbers
14	Web URLs	Any personal web address
15	IP addresses	Static or dynamic IPs associated with an individual
16	Biometric identifiers	Fingerprints, voiceprints, retinal scans
17	Full-face photographs and comparable images	Patient photos, facial scans
18	Any other unique identifying number, characteristic or code	Custom patient codes or any field not in categories 1–17

A note on Identifier 18: This catch-all clause requires judgment. Any field—even one that doesn't appear in categories 1 through 17—must be removed or generalized if it could uniquely identify an individual. Domain knowledge and contextual review are essential; automated tooling alone is not sufficient to catch every instance.

Special rules for geographic data

Geographic data is one of the most operationally nuanced categories in the Safe Harbor standard. The rule does not require removal of state-level geography—but it does require removal of all geographic subdivisions smaller than a state, unless a specific exception applies.

The three-digit zip code exception

You may retain the first three digits of a zip code if the geographic unit formed by all zip codes sharing that three-digit prefix contains more than 20,000 people, according to current U.S. Census Bureau data. If that three-digit region has a population of 20,000 or fewer, you must replace the first three digits with 000.

In practice:

Most major metropolitan zip codes can retain their first three digits.
Rural and low-population zip codes often cannot—the prefix must become 000.
Population thresholds must be verified against current Census data, not legacy figures or estimates.

This distinction carries real compliance weight for healthcare organizations serving rural populations, where even partial geographic data can narrow the field to a handful of individuals.

Special rules for dates and ages

Dates are a high-risk category and a frequent source of inadvertent compliance failures. Under Safe Harbor, you must remove all elements of dates—except the year—that are directly related to an individual. This includes birth dates, admission dates, discharge dates and dates of death.

Age over 89

If a patient is over 89 years old, you cannot retain their year of birth or any dates that would reveal their precise age. Instead, that individual must be placed into a single aggregate category: age 90 or older. The regulation requires this because the oldest patients in any dataset are among the most re-identifiable—there are simply fewer of them, and their ages combined with other retained fields can make them individually distinguishable.

One practical implication: if you're building a dataset for longitudinal research, your data pipeline must apply age aggregation consistently across all records, not just remove explicit date fields from a structured schema.

The "no actual knowledge" requirement

Removing the 18 identifiers is necessary but not always sufficient. The regulation also requires that the covered entity—or its business associate—has no actual knowledge that the remaining information could be used, alone or in combination with other information, to identify an individual.

This is not a statistical re-identification risk test. It's a practical, human-level confirmation: based on what you and your team know about this dataset, do you have reason to believe it could still identify someone?

If a researcher at your organization knows that a specific patient's rare condition makes them uniquely identifiable even after all 18 identifiers are removed, Safe Harbor is not satisfied—regardless of how thoroughly the structured fields have been processed.

The "no actual knowledge" requirement puts an obligation on the people closest to the data, not just the systems processing it. It must be part of your formal sign-off process, not an afterthought.

How to apply the HIPAA Safe Harbor method: step by step

The following process applies to both structured and unstructured health data. For unstructured sources—clinical notes, call transcripts, EHR free-text fields, discharge summaries—you'll need automated tools capable of detecting identifiers in natural language, not just labeled database columns.

Inventory your data sources. Map all systems and file types that contain PHI. This includes structured databases (EHR tables, claims files, insurance records), unstructured documents (clinical notes, referral letters, emails) and audio or image files. You cannot de-identify what you haven't located.
Classify identifiers by source. For each data source, identify which of the 18 categories are present. Don't assume structured field names tell the full story—free-text fields routinely contain names, dates, addresses and account numbers embedded in narrative text.
Apply geographic data rules. For zip codes, verify population thresholds using current Census Bureau data. Retain first three digits only where the exception permits; replace with 000 where it does not. Document your source data for each threshold decision.
Apply date rules. Remove all date elements except year. Identify all individuals over 89 and replace their birth year and any indicative dates with the "age 90 or older" aggregate label. Apply this consistently across all records, including free-text fields.
Handle the catch-all (Identifier 18). Review all custom codes, identifiers or data fields not captured in categories 1 through 17. Remove or generalize anything that could uniquely identify an individual, even if it doesn't resemble a conventional identifier.
Apply re-identification codes only if necessary. If you need to re-link records internally, you may retain a code under 45 CFR §164.514(c)—but only if the code is not derived from the individual's information, cannot be translated back to identify the individual and will not be disclosed outside the covered entity.
Confirm no actual knowledge. Have the team members most familiar with the dataset review the de-identified output and confirm that no remaining information—alone or in combination—could identify an individual based on what they know. Document who performed this review and when.
Document the entire process. Maintain records of your methodology, the data sources and fields processed, the tools applied, the date of review and the individuals who confirmed the actual knowledge requirement. This is your audit trail if your process is ever reviewed by HHS or examined during a breach investigation.

Safe Harbor vs. Expert Determination: which method fits your use case?

Both methods produce legally de-identified data under the HIPAA Privacy Rule, but they differ significantly in process, flexibility and the kinds of datasets they serve best. Expert Determination—the alternative to Safe Harbor—applies statistical modeling to assess re-identification risk, giving organizations more flexibility when datasets require retaining granular geographic or temporal data that Safe Harbor would otherwise require removing.

	Safe Harbor	Expert Determination
Legal basis	45 CFR §164.514(b)(2)	45 CFR §164.514(b)(1)
Approach	Remove 18 specific identifier categories	Statistical analysis by a qualified expert
Flexibility	Prescriptive—fewer discretionary decisions	Flexible—more data can be retained if risk is demonstrably low
Documentation	Process documentation and sign-off	Formal expert determination report
Relative cost	Lower—can be systematized at scale	Higher—requires qualified expert engagement
Best for	Standard data sharing, analytics, AI training datasets	Rare conditions, small populations, research requiring geographic or temporal granularity
HHS standing	Satisfies de-identification by definition if applied correctly	Satisfies de-identification when performed by a qualified expert

If your dataset contains rare diagnoses, small patient subgroups or you need to retain granular geographic or date information for research validity, Expert Determination may allow you to preserve more data while still achieving compliance. For most standard operational and analytics use cases, Safe Harbor is faster to implement and easier to scale.

Common mistakes that compromise Safe Harbor compliance

Knowing the 18 identifiers doesn't prevent every compliance failure. These are the errors most likely to undermine an otherwise well-structured de-identification process.

Overlooking unstructured data. PHI doesn't only live in labeled database columns. Names, dates and medical record numbers appear constantly in clinical notes, referral letters, discharge summaries and support transcripts. Structured database de-identification tools miss these entirely.
Misapplying the zip code rule. Retaining a three-digit zip code without verifying the current Census population threshold is a common shortcut—and one that may not survive regulatory scrutiny.
Leaving date elements in free text. Removing a date-of-birth column from a database doesn't help if a clinical note reads "patient presented on their 68th birthday." Automated NLP-based detection is essential for unstructured sources.
Skipping the "no actual knowledge" check. The 18-identifier removal is process-level compliance. The actual knowledge check is human-level compliance. Both are required—and the second is the one most organizations skip.
Using re-identification codes derived from PHI. If your internal code is based on the patient's Social Security number, medical record number or any other identifier, it violates the Privacy Rule regardless of how obscured it appears.
Failing to document the process. Without written records of your methodology, tool choices and sign-off, you have no defensible audit trail—even if your de-identification was technically correct.

Ready to de-identify PHI at scale?

Applying the HIPAA Safe Harbor method manually is error-prone and difficult to maintain as data volumes grow—especially when PHI is embedded in clinical notes, call transcripts, EHR free-text and other unstructured sources. Limina's de-identification platform identifies and removes all 18 Safe Harbor identifier categories across structured and unstructured data.

Get a demo

Explore Limina's data de-identification platform

Share this post

Copy link

Frequently Asked Questions

Does removing the 18 identifiers automatically make data HIPAA-compliant for sharing?

Removing the 18 identifiers satisfies the technical component of Safe Harbor de-identification, but not the full standard. The regulation also requires that the covered entity has no actual knowledge that the remaining information could be used to identify an individual. Both conditions must be met. Once they are, the data is no longer PHI under the Privacy Rule and can be shared or used without HIPAA restrictions applying to that data specifically.

Can you use Safe Harbor for AI and machine learning?

Yes—de-identified data produced under the HIPAA Safe Harbor method can be used for AI model training, analytics and secondary research without triggering HIPAA restrictions, because the data is no longer PHI. However, the accuracy and completeness of de-identification matters enormously for AI use cases. Tools that miss identifiers in unstructured text can leave PHI in what appears to be a clean training dataset, creating both a compliance failure and a model integrity problem that is difficult to detect after the fact.

What’s the difference between Safe Harbor and anonymization under GDPR?

The HIPAA Safe Harbor standard and GDPR anonymization are distinct legal tests that do not automatically satisfy each other. GDPR’s anonymization standard—which exempts data from GDPR obligations entirely—requires that re-identification be reasonably unlikely given all available means, including third-party data. HIPAA’s Safe Harbor is a prescriptive checklist. An organization operating in both the US and EU must assess its data against both frameworks independently.

Is the Safe Harbor method suitable for rare conditions or small patient populations?

It can be problematic in these cases. For datasets containing rare diagnoses or very small patient subgroups, the combination of even non-obvious retained fields can make individuals re-identifiable after Safe Harbor removal. In these situations, Expert Determination—which applies statistical re-identification risk modeling—is typically more appropriate. A qualified expert can assess risk against the specific characteristics of your dataset and produce a formal report documenting the methodology and findings.

Do business associates need to apply Safe Harbor, or only covered entities?

Both covered entities and their business associates are subject to HIPAA’s de-identification requirements when handling PHI. If a business associate is sharing or using PHI beyond what is required to perform its contracted services, that data must be properly de-identified first. The same two Safe Harbor conditions apply regardless of which entity performs the de-identification.

What documentation should you keep after applying Safe Harbor?

You should document the specific methodology applied, the data sources and fields processed, the tools or procedures used for each identifier category, the date the process was completed and the individuals who confirmed the “no actual knowledge” requirement. This documentation is your audit trail if your de-identification process is ever reviewed by HHS, questioned by a business partner or examined during a breach investigation.

How does Safe Harbor relate to broader de-identification approaches like pseudonymization?

Safe Harbor is a prescriptive removal standard—it focuses on eliminating or suppressing identifiers. Pseudonymization, by contrast, replaces identifiers with substitutes that can be reversed with a key. Under HIPAA, pseudonymized data is not de-identified if the key can be used to re-identify individuals. Safe Harbor requires that no such re-identification is possible.

HIPAA Safe Harbor Method: A Complete Step-by-Step Guide

HIPAA Safe Harbor Method: A Complete Step-by-Step Guide

Who the Safe Harbor method applies to

The 18 identifier categories you must remove

Special rules for geographic data

The three-digit zip code exception

Special rules for dates and ages

Age over 89

The "no actual knowledge" requirement

How to apply the HIPAA Safe Harbor method: step by step

Safe Harbor vs. Expert Determination: which method fits your use case?

Common mistakes that compromise Safe Harbor compliance

Related Articles

LLM Training on Healthcare Data: Compliance and De-identification Requirements

Privacy-Preserving AI: How De-identification Enables Compliant Model Training

AI Training Data Privacy: What Every Data Team Needs to Know

Frequently Asked Questions

Does removing the 18 identifiers automatically make data HIPAA-compliant for sharing?

Can you use Safe Harbor for AI and machine learning?

What’s the difference between Safe Harbor and anonymization under GDPR?

Is the Safe Harbor method suitable for rare conditions or small patient populations?

Do business associates need to apply Safe Harbor, or only covered entities?

What documentation should you keep after applying Safe Harbor?

How does Safe Harbor relate to broader de-identification approaches like pseudonymization?