Healthtech · PII Detection · Benchmark Analysis · 2026

PII Detection for Healthcare AI: What the Benchmarks Show

A systematic comparison of five PII detection platforms on accuracy, HIPAA coverage, multilingual performance, and deployment architecture, benchmarked on the ai4privacy 500k dataset.

Limina Research
10 min read

Deploying AI in healthcare, whether for patient communication, clinical documentation, or revenue cycle workflows, means every piece of data must pass through a security and compliance layer before touching any model or log. The PII detection tool you choose determines your exposure surface, your regulatory posture, and your ability to operate across markets.

This analysis evaluates four commonly deployed options, including OpenAI’s foray in the category: AWS Comprehend, Azure Language Services, Microsoft Presidio, the OpenAI Privacy Filter, and Limina. All tests use the ai4privacy 500k dataset, a multi-domain corpus spanning finance, healthcare, and legal text, with labels mapped to a common schema. No tool in this analysis was trained on any portion of the evaluation dataset.

Dataset
ai4privacy 500k
Dataset
ai4privacy 500k
Dataset
ai4privacy 500k
WHY THIS MATTERS FOR HEALTHTECH

The Compliance Gap Most Tools Leave Open

Healthcare organizations building AI products face requirements that general-purpose PII tools were not designed for. HIPAA Safe Harbor requires removing all 18 defined identifiers. Expert Determination—a stronger standard—requires a qualified statistician to certify that re-identification risk is statistically very small. Most tools don’t document which standard they meet. Beyond the regulatory question, three practical failure modes matter in production: recall gaps (entities that slip through undetected), language coverage (patients don’t communicate only in English), and deployment architecture (whether data ever leaves your environment at all).

The benchmark priority for healthtech: Recall, the share of PII actually caught, matters more than precision. A false positive is an inconvenience. A false negative is a reportable incident.

HIPAA SAFE HARBOR

18 Identifiers

Names, geodata, dates, SSN, MRN, device IDs, biometrics

TOOLS WITH FULL COVERAGE

1 of 5 Tested

AWS & Azure undocumented. Presidio & OpenAI PF partial.

EXPERT DETERMINATION

99.96%

Statistician-certified re-id. Higher bar than Safe Harbor.

OPENAI PF COVERAGE

8 of 18

Less than half of Safe Harbor. No PCI coverage.

Get the Full Benchmark Report

The benchmark priority for healthtech: Recall, the share of PII actually caught, matters more than precision. A false positive is an inconvenience. A false negative is a reportable incident.

GET METHODOLOGY REPORT
GET METHODOLOGY REPORT
DETECTION ACCURACY

Head-to-head benchmark results

Results compare each tool’s recall, adjusted recall, and F1 against Limina’s on the ai4privacy 500k dataset. Adjusted recall measures entities detected regardless of exact label—a useful proxy for real-world coverage gaps.

AWS Comprehend
Azure Language
MS Presidio
OpenAI PF
Metric
Difference
What this means
Recall
+10.4% (+0.0854)
More PII caught before it reaches any log or model
Adjusted Recall
+12.1% (+0.1037)
More PII caught before it reaches any log or model
F1 (overall)
+5.5% (+0.0473)
More PII caught before it reaches any log or model
Custom privacy-focused dataset:

Limina 0.940/0.939/0.961/0.938 vs. AWS Comprehend 0.904/0.698/0.764/0.777 (Precision/Recall/Adj.Recall/F1)

Azure Language, MS Presidio and OpenAI PF results in the full methodology report

DETECTION ACCURACY

Head-to-head benchmark results

Results compare each tool’s recall, adjusted recall, and F1 against Limina’s on the ai4privacy 500k dataset. Adjusted recall measures entities detected regardless of exact label—a useful proxy for real-world coverage gaps.

Metric
Limina
AWS Comprehend
Azure Language
MS Presidio
OpenAI PF
On-prem / VPC deploy
Data leaves environment
Never
Yes
Config-dep.
Never
No
Languages (async)
Never
2
3 + preview
1 native
~10
All 18 HIPAA identifiers
Unknown
Partial
Partial
8 of 18
Full PCI coverage
Unknown
Partial
Partial
Deterministic output
Partial
Independent audit
Never
-
-
-
-
INDEPENDENT VERIFICATION

Two Kinds of External Validation

Self-reported benchmarks and independently audited results answer different questions. For compliance teams evaluating vendors, the distinction matters.

Model Accuracy Audit

99.5%+

Independently validated by Armilla AI

Third-party audit of the detection engine itself — not a self-reported figure. Armilla AI warranted accuracy independently.

Regulatory — HIPAA Expert Determination

99.96%

Validated by Aetion / Datavant

Requires a qualified statistician to certify re-identification risk is statistically very small. Higher bar than HIPAA Safe Harbor.

54

Languages in active deployment

99.5%+

Independently validated accuracy

99.96%

HIPAA Expert Determination

18/18

HIPAA identifiers covered

INDEPENDENT VERIFICATION

What’s in the complete benchmark report

Detailed methodology, per-entity-type breakdowns, extended multilingual results across 8+ languages, and scoring documentation for compliance and security review.

Per-entity breakdowns
8-language recall
Per-entity breakdowns
HIPAA identifier mapping
Scoring methodology
Dataset

Benchmarks: ai4privacy 500k dataset, labels mapped to common schema, Limina v4.3.0-GPU, April 2026. No tool evaluated was trained on any portion of the evaluation dataset. AWS Comprehend, Azure Language Services, Microsoft Presidio, and OpenAI Privacy Filter are products of their respective owners.