HIPAA Safe Harbor Method: A Complete Step-by-Step Guide
A single unmasked zip code or leftover admission date can be enough to re-identify a patient from a dataset that looks, on its surface, perfectly clean. Privacy researcher Latanya Sweeney demonstrated that 87 percent of Americans could be uniquely identified using just three data points: zip code, birth date and sex. For compliance teams and data engineers working with health data, the HIPAA Safe Harbor method provides a concrete, regulator-defined standard for what de-identification actually requires.
This guide covers every element of the Safe Harbor standard: the full 18-identifier list, the special rules for geographic data and dates, the "no actual knowledge" requirement, a step-by-step process for applying it and a side-by-side comparison with Expert Determination to help you choose the right approach for your use case.
Who the Safe Harbor method applies to
The Safe Harbor method applies to covered entities and their business associates that handle PHI. A covered entity is a health plan, healthcare clearinghouse or healthcare provider that transmits health information electronically. Business associates—organizations that process PHI on behalf of covered entities—are also bound by these requirements under the HIPAA Omnibus Rule.
If you're building AI models, conducting research, sharing data with analytics partners or enabling secondary use of health records, you need a legally defensible de-identification method before any PHI leaves your controlled environment. The HIPAA Safe Harbor method is the more prescriptive of the two available options: follow the 18-identifier checklist, confirm no actual knowledge of re-identification risk and the data is de-identified by regulatory definition.
The 18 identifier categories you must remove
Under 45 CFR §164.514(b)(2), the following identifier categories must be removed from PHI before data can be considered de-identified under the Safe Harbor standard. This list is drawn directly from HHS guidance on de-identification.
| # |
Identifier category |
Common examples |
| 1 |
Names |
First, last and maiden names; initials linked to other data |
| 2 |
Geographic subdivisions smaller than state |
Street address, city, county, zip code (exceptions apply—see below) |
| 3 |
Dates related to an individual (except year) |
Birth date, admission date, discharge date, date of death |
| 4 |
Telephone numbers |
Mobile, home and office numbers |
| 5 |
Fax numbers |
All fax numbers |
| 6 |
Email addresses |
Personal or work email |
| 7 |
Social Security numbers |
Full or partial SSNs |
| 8 |
Medical record numbers |
EHR record IDs |
| 9 |
Health plan beneficiary numbers |
Insurance member IDs |
| 10 |
Account numbers |
Financial or provider account IDs |
| 11 |
Certificate and license numbers |
Medical licenses, DEA registration numbers |
| 12 |
Vehicle identifiers and serial numbers |
License plate numbers, VINs |
| 13 |
Device identifiers and serial numbers |
Pacemaker IDs, implant serial numbers |
| 14 |
Web URLs |
Any personal web address |
| 15 |
IP addresses |
Static or dynamic IPs associated with an individual |
| 16 |
Biometric identifiers |
Fingerprints, voiceprints, retinal scans |
| 17 |
Full-face photographs and comparable images |
Patient photos, facial scans |
| 18 |
Any other unique identifying number, characteristic or code |
Custom patient codes or any field not in categories 1–17 |
A note on Identifier 18: This catch-all clause requires judgment. Any field—even one that doesn't appear in categories 1 through 17—must be removed or generalized if it could uniquely identify an individual. Domain knowledge and contextual review are essential; automated tooling alone is not sufficient to catch every instance.
Special rules for geographic data
Geographic data is one of the most operationally nuanced categories in the Safe Harbor standard. The rule does not require removal of state-level geography—but it does require removal of all geographic subdivisions smaller than a state, unless a specific exception applies.
The three-digit zip code exception
You may retain the first three digits of a zip code if the geographic unit formed by all zip codes sharing that three-digit prefix contains more than 20,000 people, according to current U.S. Census Bureau data. If that three-digit region has a population of 20,000 or fewer, you must replace the first three digits with 000.
In practice:
- Most major metropolitan zip codes can retain their first three digits.
- Rural and low-population zip codes often cannot—the prefix must become 000.
- Population thresholds must be verified against current Census data, not legacy figures or estimates.
This distinction carries real compliance weight for healthcare organizations serving rural populations, where even partial geographic data can narrow the field to a handful of individuals.
Special rules for dates and ages
Dates are a high-risk category and a frequent source of inadvertent compliance failures. Under Safe Harbor, you must remove all elements of dates—except the year—that are directly related to an individual. This includes birth dates, admission dates, discharge dates and dates of death.
Age over 89
If a patient is over 89 years old, you cannot retain their year of birth or any dates that would reveal their precise age. Instead, that individual must be placed into a single aggregate category: age 90 or older. The regulation requires this because the oldest patients in any dataset are among the most re-identifiable—there are simply fewer of them, and their ages combined with other retained fields can make them individually distinguishable.
One practical implication: if you're building a dataset for longitudinal research, your data pipeline must apply age aggregation consistently across all records, not just remove explicit date fields from a structured schema.
The "no actual knowledge" requirement
Removing the 18 identifiers is necessary but not always sufficient. The regulation also requires that the covered entity—or its business associate—has no actual knowledge that the remaining information could be used, alone or in combination with other information, to identify an individual.
This is not a statistical re-identification risk test. It's a practical, human-level confirmation: based on what you and your team know about this dataset, do you have reason to believe it could still identify someone?
If a researcher at your organization knows that a specific patient's rare condition makes them uniquely identifiable even after all 18 identifiers are removed, Safe Harbor is not satisfied—regardless of how thoroughly the structured fields have been processed.
The "no actual knowledge" requirement puts an obligation on the people closest to the data, not just the systems processing it. It must be part of your formal sign-off process, not an afterthought.
How to apply the HIPAA Safe Harbor method: step by step
The following process applies to both structured and unstructured health data. For unstructured sources—clinical notes, call transcripts, EHR free-text fields, discharge summaries—you'll need automated tools capable of detecting identifiers in natural language, not just labeled database columns.
- Inventory your data sources. Map all systems and file types that contain PHI. This includes structured databases (EHR tables, claims files, insurance records), unstructured documents (clinical notes, referral letters, emails) and audio or image files. You cannot de-identify what you haven't located.
- Classify identifiers by source. For each data source, identify which of the 18 categories are present. Don't assume structured field names tell the full story—free-text fields routinely contain names, dates, addresses and account numbers embedded in narrative text.
- Apply geographic data rules. For zip codes, verify population thresholds using current Census Bureau data. Retain first three digits only where the exception permits; replace with 000 where it does not. Document your source data for each threshold decision.
- Apply date rules. Remove all date elements except year. Identify all individuals over 89 and replace their birth year and any indicative dates with the "age 90 or older" aggregate label. Apply this consistently across all records, including free-text fields.
- Handle the catch-all (Identifier 18). Review all custom codes, identifiers or data fields not captured in categories 1 through 17. Remove or generalize anything that could uniquely identify an individual, even if it doesn't resemble a conventional identifier.
- Apply re-identification codes only if necessary. If you need to re-link records internally, you may retain a code under 45 CFR §164.514(c)—but only if the code is not derived from the individual's information, cannot be translated back to identify the individual and will not be disclosed outside the covered entity.
- Confirm no actual knowledge. Have the team members most familiar with the dataset review the de-identified output and confirm that no remaining information—alone or in combination—could identify an individual based on what they know. Document who performed this review and when.
- Document the entire process. Maintain records of your methodology, the data sources and fields processed, the tools applied, the date of review and the individuals who confirmed the actual knowledge requirement. This is your audit trail if your process is ever reviewed by HHS or examined during a breach investigation.
Safe Harbor vs. Expert Determination: which method fits your use case?
Both methods produce legally de-identified data under the HIPAA Privacy Rule, but they differ significantly in process, flexibility and the kinds of datasets they serve best. Expert Determination—the alternative to Safe Harbor—applies statistical modeling to assess re-identification risk, giving organizations more flexibility when datasets require retaining granular geographic or temporal data that Safe Harbor would otherwise require removing.
|
Safe Harbor |
Expert Determination |
| Legal basis |
45 CFR §164.514(b)(2) |
45 CFR §164.514(b)(1) |
| Approach |
Remove 18 specific identifier categories |
Statistical analysis by a qualified expert |
| Flexibility |
Prescriptive—fewer discretionary decisions |
Flexible—more data can be retained if risk is demonstrably low |
| Documentation |
Process documentation and sign-off |
Formal expert determination report |
| Relative cost |
Lower—can be systematized at scale |
Higher—requires qualified expert engagement |
| Best for |
Standard data sharing, analytics, AI training datasets |
Rare conditions, small populations, research requiring geographic or temporal granularity |
| HHS standing |
Satisfies de-identification by definition if applied correctly |
Satisfies de-identification when performed by a qualified expert |
If your dataset contains rare diagnoses, small patient subgroups or you need to retain granular geographic or date information for research validity, Expert Determination may allow you to preserve more data while still achieving compliance. For most standard operational and analytics use cases, Safe Harbor is faster to implement and easier to scale.
Common mistakes that compromise Safe Harbor compliance
Knowing the 18 identifiers doesn't prevent every compliance failure. These are the errors most likely to undermine an otherwise well-structured de-identification process.
- Overlooking unstructured data. PHI doesn't only live in labeled database columns. Names, dates and medical record numbers appear constantly in clinical notes, referral letters, discharge summaries and support transcripts. Structured database de-identification tools miss these entirely.
- Misapplying the zip code rule. Retaining a three-digit zip code without verifying the current Census population threshold is a common shortcut—and one that may not survive regulatory scrutiny.
- Leaving date elements in free text. Removing a date-of-birth column from a database doesn't help if a clinical note reads "patient presented on their 68th birthday." Automated NLP-based detection is essential for unstructured sources.
- Skipping the "no actual knowledge" check. The 18-identifier removal is process-level compliance. The actual knowledge check is human-level compliance. Both are required—and the second is the one most organizations skip.
- Using re-identification codes derived from PHI. If your internal code is based on the patient's Social Security number, medical record number or any other identifier, it violates the Privacy Rule regardless of how obscured it appears.
- Failing to document the process. Without written records of your methodology, tool choices and sign-off, you have no defensible audit trail—even if your de-identification was technically correct.
Ready to de-identify PHI at scale?
Applying the HIPAA Safe Harbor method manually is error-prone and difficult to maintain as data volumes grow—especially when PHI is embedded in clinical notes, call transcripts, EHR free-text and other unstructured sources. Limina's de-identification platform identifies and removes all 18 Safe Harbor identifier categories across structured and unstructured data.
Get a demo
Explore Limina's data de-identification platform