PII Detection for Healthcare AI: What the Benchmarks Show
A systematic comparison of five PII detection platforms on accuracy, HIPAA coverage, multilingual performance, and deployment architecture, benchmarked on the ai4privacy 500k dataset.
Deploying AI in healthcare, whether for patient communication, clinical documentation, or revenue cycle workflows, means every piece of data must pass through a security and compliance layer before touching any model or log. The PII detection tool you choose determines your exposure surface, your regulatory posture, and your ability to operate across markets.
This analysis evaluates four commonly deployed options, including OpenAI’s foray in the category: AWS Comprehend, Azure Language Services, Microsoft Presidio, the OpenAI Privacy Filter, and Limina. All tests use the ai4privacy 500k dataset, a multi-domain corpus spanning finance, healthcare, and legal text, with labels mapped to a common schema. No tool in this analysis was trained on any portion of the evaluation dataset.
The Compliance Gap Most Tools Leave Open
Healthcare organizations building AI products face requirements that general-purpose PII tools were not designed for. HIPAA Safe Harbor requires removing all 18 defined identifiers. Expert Determination—a stronger standard—requires a qualified statistician to certify that re-identification risk is statistically very small. Most tools don’t document which standard they meet. Beyond the regulatory question, three practical failure modes matter in production: recall gaps (entities that slip through undetected), language coverage (patients don’t communicate only in English), and deployment architecture (whether data ever leaves your environment at all).
The benchmark priority for healthtech: Recall, the share of PII actually caught, matters more than precision. A false positive is an inconvenience. A false negative is a reportable incident.
18 Identifiers
Names, geodata, dates, SSN, MRN, device IDs, biometrics
1 of 5 Tested
AWS & Azure undocumented. Presidio & OpenAI PF partial.
99.96%
Statistician-certified re-id. Higher bar than Safe Harbor.
8 of 18
Less than half of Safe Harbor. No PCI coverage.
Head-to-head benchmark results
Results compare each tool’s recall, adjusted recall, and F1 against Limina’s on the ai4privacy 500k dataset. Adjusted recall measures entities detected regardless of exact label—a useful proxy for real-world coverage gaps.
Limina 0.940/0.939/0.961/0.938 vs. AWS Comprehend 0.904/0.698/0.764/0.777 (Precision/Recall/Adj.Recall/F1)
Azure Language, MS Presidio and OpenAI PF results in the full methodology report
Head-to-head benchmark results
Results compare each tool’s recall, adjusted recall, and F1 against Limina’s on the ai4privacy 500k dataset. Adjusted recall measures entities detected regardless of exact label—a useful proxy for real-world coverage gaps.
Two Kinds of External Validation
Self-reported benchmarks and independently audited results answer different questions. For compliance teams evaluating vendors, the distinction matters.
99.5%+
Third-party audit of the detection engine itself — not a self-reported figure. Armilla AI warranted accuracy independently.
99.96%
Requires a qualified statistician to certify re-identification risk is statistically very small. Higher bar than HIPAA Safe Harbor.
54
Languages in active deployment
99.5%+
Independently validated accuracy
99.96%
HIPAA Expert Determination
18/18
HIPAA identifiers covered
What’s in the complete benchmark report
Detailed methodology, per-entity-type breakdowns, extended multilingual results across 8+ languages, and scoring documentation for compliance and security review.
Benchmarks: ai4privacy 500k dataset, labels mapped to common schema, Limina v4.3.0-GPU, April 2026. No tool evaluated was trained on any portion of the evaluation dataset. AWS Comprehend, Azure Language Services, Microsoft Presidio, and OpenAI Privacy Filter are products of their respective owners.