Your research team has spent months building a dataset from patient records across five hospitals. The IRB application is drafted, the FDA submission is three weeks out, and your legal team has just flagged a problem: Safe Harbor de-identification alone may not satisfy the evidentiary standard your reviewers will expect.
This is not a hypothetical. FDA submissions, IRB protocols, and multi-site research data sharing agreements are each governed by distinct—and overlapping—requirements for how Protected Health Information (PHI) must be handled before it can be used in research. In many of these scenarios, HIPAA expert determination is not optional. It is the expected standard.
| What is expert determination in a research context? Under 45 CFR §164.514(b)(1), HIPAA expert determination is a de-identification method in which a qualified statistician applies generally accepted statistical principles to assess the risk that a dataset could be used to identify an individual. The statistician must determine that this risk is "very small" and document their methods and results in a written report. For research data, this standard is often required—not merely preferred—because Safe Harbor's blunt removal of 18 identifiers frequently destroys the clinical specificity that makes research data valuable. |
This article walks through the specific requirements that FDA, IRBs, and HIPAA impose on research data de-identification, and explains when expert determination is the right—or only—path forward.
When FDA requires statistical de-identification validation
The FDA does not have a single, uniform rule that requires expert determination by name. However, two regulatory frameworks create strong practical pressure toward it.
21 CFR Part 11 and data integrity requirements
21 CFR Part 11 governs electronic records and electronic signatures in FDA-regulated research. It requires that systems and processes producing research data demonstrate auditability, accuracy, and integrity. When de-identification is part of a data preparation pipeline for a clinical study or drug approval submission, Part 11's documentation requirements effectively require a demonstrable, auditable methodology—the kind that a well-structured expert determination report provides. Safe Harbor, which is process-based rather than evidence-based, cannot produce the statistical documentation that Part 11's audit trail expectations imply.
Real-World Evidence guidance and de-identification standards
FDA's Real-World Evidence (RWE) framework, outlined in guidance documents issued under the 21st Century Cures Act, introduces specific expectations for how real-world data (RWD)—including EHR data, insurance claims, and patient registries—is prepared for regulatory review. The guidance emphasizes the need for "fit-for-purpose" data quality and calls for documented methodology when sensitive patient data is included in RWE studies.
In practice, FDA reviewers evaluating RWE submissions increasingly expect applicants to demonstrate that de-identification was performed rigorously and that re-identification risk has been quantitatively assessed. Safe Harbor's categorical removal of 18 identifiers satisfies HIPAA technically, but it does not produce the statistical artifact that demonstrates risk quantification. Expert determination does.
Key requirements for FDA-facing research data de-identification:
- Documented, reproducible methodology
- Quantitative assessment of re-identification risk
- Independent expert certification with stated credentials
- Audit-ready output tied to the specific dataset and time period
- Written report retained as part of the regulatory submission package
IRB requirements for de-identified research data
Institutional Review Boards (IRBs) are governed by the Common Rule (45 CFR Part 46) and, for HIPAA-covered entities, by the Privacy Rule. When a researcher seeks an IRB waiver of authorization—the standard mechanism for using PHI in research without individual consent—the quality of de-identification is a central determination.
When an IRB waiver depends on de-identification quality
Under 45 CFR §164.512(i), a covered entity may use or disclose PHI for research without individual authorization if an IRB or Privacy Board has granted a waiver. One pathway to that waiver is demonstrating that the research involves no more than minimal risk to subjects' privacy. The strength of your de-identification methodology directly influences whether an IRB will agree that privacy risk is minimal.
IRBs vary significantly in how rigorous their review is. Smaller academic IRBs may accept a Safe Harbor attestation. Large research hospitals, NIH-funded programs, and multi-site studies increasingly expect documented statistical validation. A well-constructed expert determination report—covering methodology, risk quantification, and expert credentials—strengthens any IRB application and reduces the likelihood of a request for additional information, a delay that can cost months.
What IRBs look for in de-identification documentation
| IRB review criterion |
Safe Harbor satisfies? |
Expert determination satisfies? |
| De-identification method documented |
Yes—method is defined by HIPAA statute |
Yes—plus quantitative risk analysis |
| Re-identification risk quantified |
No—categorical removal only |
Yes—core deliverable of the report |
| Independent expert certification |
Not required |
Yes—required by definition |
| Data utility preserved for research |
Often not—removal is blunt |
Yes—statistical approach preserves usable fields |
| Suitable for longitudinal or rare disease data |
Rarely |
Yes—designed for complex datasets |
| Audit-ready report available for IRB file |
No formal report |
Yes—written report is the deliverable |
HIPAA expert determination for multi-site research
Multi-site research—studies that aggregate patient data from multiple covered entities—creates additional de-identification complexity that Safe Harbor is not designed to handle.
The aggregation problem in multi-site datasets
Safe Harbor removes 18 specific identifiers. But when data from multiple institutions is combined, seemingly innocuous fields—rare diagnoses, unusual procedure codes, geographic data generalized only to state level—can become identifying in combination. The statistical re-identification risk in a merged multi-site dataset is higher than in any single institution's data, even after Safe Harbor removal.
Expert determination addresses this directly. A qualified statistician assesses re-identification risk in the combined dataset—accounting for population rarity, data richness, and the specific combination of fields—and documents whether risk is very small. This is why data sharing agreements in multi-site research almost always specify expert determination for sensitive datasets.
Data use agreements and covered entity obligations
When a covered entity shares data under a Data Use Agreement (DUA) for research, it retains responsibility for ensuring the data was de-identified in compliance with HIPAA before sharing. If the receiving party later uses the data in a way that causes a breach or re-identification event, the originating covered entity may face OCR scrutiny of its de-identification methodology.
Expert determination—with a documented, signed report from a qualified independent statistician—provides a defensible record that the covered entity met its HIPAA obligations. A Safe Harbor checklist does not offer the same evidentiary weight in an OCR investigation or litigation context.
Safe Harbor vs expert determination for research data
Both methods are HIPAA-compliant. The question is which is appropriate for your specific research use case. The answer depends on what you're doing with the data and who's reviewing it.
| Factor |
Safe Harbor |
Expert determination |
| How it works |
Remove all 18 HIPAA identifiers |
Statistician assesses re-identification risk quantitatively |
| Who performs it |
Your team, following HIPAA statute |
Qualified independent statistician |
| Output |
Compliant dataset + process documentation |
Written report with methodology, risk assessment, certification |
| Data utility |
Often significantly reduced |
Preserves more data—structured around what's safe to keep |
| Suitable for FDA submissions |
May not meet evidentiary standard |
Yes—provides audit-ready statistical validation |
| Suitable for IRB waiver applications |
Depends on IRB rigor |
Strongest possible documentation for IRB review |
| Multi-site aggregated data |
Insufficient for combined datasets |
Designed for complex, aggregated datasets |
| Rare disease or longitudinal studies |
Often destroys clinical value |
Preserves longitudinal and rare-population data |
| Audit defensibility |
Limited |
High—signed expert report is a legal artifact |
The general principle: Safe Harbor is appropriate for routine operational uses of de-identified data—analytics, training AI models on common disease populations, reporting. Expert determination is appropriate—and often required—when the data will be reviewed by a regulatory body, shared under a DUA, or used in longitudinal or rare-population research where data specificity matters.
What researchers need from a de-identification platform
Expert determination begins with clean, well-documented de-identification. The quality of the statistician's analysis depends entirely on the quality of the inputs. A de-identification platform used in research contexts must meet four requirements:
- High accuracy on clinical data. General-purpose cloud tools detect 60–70 percent of PHI in real clinical datasets. That miss rate is not acceptable in research or regulatory contexts. Limina's purpose-built models achieve 99.5 percent or higher accuracy on real healthcare data—meaning the de-identified output the expert receives is genuinely clean, not nominally clean.
- Complete audit trails. Every de-identification run must produce a documented record of what was detected, what was redacted, and what methodology governed the process. This documentation becomes part of the expert's supporting evidence and may be required by the IRB or FDA.
- Expert-ready output structure. The de-identified dataset must be formatted in a way that allows the statistician to perform their re-identification risk analysis efficiently. This includes preserved metadata, field-level documentation, and output formats the expert's statistical tools can consume.
- Deployment that protects data. Research data—especially multi-site PHI—cannot leave institutional infrastructure. The de-identification platform must deploy within your environment (in-VPC or on-premises), not via a cloud API that routes data to a third-party server.
How Limina supports research data de-identification
For pharma and life sciences organizations and academic medical centers preparing data for FDA submissions or IRB review, Limina's data de-identification platform handles the de-identification layer that precedes the expert's statistical analysis—producing the clean, structured, audit-ready output the report depends on.
Limina deploys in-VPC or on-premises, ensuring that research data never leaves your controlled infrastructure during the de-identification process. It detects and redacts PHI across unstructured data formats—clinical notes, EHR exports, research transcripts, PDF reports—with accuracy that meets the evidentiary standard expert determination requires.
The platform produces structured, audit-ready outputs—including field-level detection logs, redaction reports, and dataset summaries—in the format independent statisticians need to perform re-identification risk analysis efficiently. Limina also works with a partner network of qualified independent experts who produce expert determination reports specifically structured for FDA, IRB, and HIPAA audit review.
Ready to prepare your research data for FDA, IRB, and HIPAA review?
Limina's data de-identification platform produces audit-ready outputs structured for expert determination—deployed within your infrastructure, built for the accuracy standards that research and regulatory contexts demand.
Whether you're preparing a Real-World Evidence submission, an IRB application, or a multi-site data sharing package, Limina provides the de-identified input your independent expert needs to certify re-identification risk with confidence.
Get a demo—talk to us about your specific dataset and research use case.