Healthtech · PII Detection · Benchmark Analysis · 2026

PII Detection for Healthcare AI: What the Benchmarks Show

A systematic comparison of five PII detection platforms on accuracy, HIPAA coverage, multilingual performance, and deployment architecture, benchmarked on the ai4privacy 500k dataset.

Limina Research

10 min read

Deploying AI in healthcare, whether for patient communication, clinical documentation, or revenue cycle workflows, means every piece of data must pass through a security and compliance layer before touching any model or log. The PII detection tool you choose determines your exposure surface, your regulatory posture, and your ability to operate across markets.

This analysis evaluates four commonly deployed options, including OpenAI’s foray in the category: AWS Comprehend, Azure Language Services, Microsoft Presidio, the OpenAI Privacy Filter, and Limina. All tests use the ai4privacy 500k dataset, a multi-domain corpus spanning finance, healthcare, and legal text, with labels mapped to a common schema. No tool in this analysis was trained on any portion of the evaluation dataset.

Dataset

ai4privacy 500k

Dataset

ai4privacy 500k

Dataset

ai4privacy 500k

WHY THIS MATTERS FOR HEALTHTECH

The Compliance Gap Most Tools Leave Open

Healthcare organizations building AI products face requirements that general-purpose PII tools were not designed for. HIPAA Safe Harbor requires removing all 18 defined identifiers. Expert Determination—a stronger standard—requires a qualified statistician to certify that re-identification risk is statistically very small. Most tools don’t document which standard they meet. Beyond the regulatory question, three practical failure modes matter in production: recall gaps (entities that slip through undetected), language coverage (patients don’t communicate only in English), and deployment architecture (whether data ever leaves your environment at all).

The benchmark priority for healthtech: Recall, the share of PII actually caught, matters more than precision. A false positive is an inconvenience. A false negative is a reportable incident.

HIPAA SAFE HARBOR

18 Identifiers

Names, geodata, dates, SSN, MRN, device IDs, biometrics

TOOLS WITH FULL COVERAGE

1 of 5 Tested

AWS & Azure undocumented. Presidio & OpenAI PF partial.

EXPERT DETERMINATION

99.96%

Statistician-certified re-id. Higher bar than Safe Harbor.

OPENAI PF COVERAGE

8 of 18

Less than half of Safe Harbor. No PCI coverage.

Get the Full Benchmark Report

The benchmark priority for healthtech: Recall, the share of PII actually caught, matters more than precision. A false positive is an inconvenience. A false negative is a reportable incident.

GET METHODOLOGY REPORT

DETECTION ACCURACY

Head-to-head benchmark results

Results compare each tool’s recall, adjusted recall, and F1 against Limina’s on the ai4privacy 500k dataset. Adjusted recall measures entities detected regardless of exact label—a useful proxy for real-world coverage gaps.

AWS Comprehend

Azure Language

MS Presidio

OpenAI PF

Metric

Difference

What this means

Recall

+10.4% (+0.0854)

More PII caught before it reaches any log or model

Adjusted Recall

+12.1% (+0.1037)

More PII caught before it reaches any log or model

F1 (overall)

+5.5% (+0.0473)

More PII caught before it reaches any log or model

Custom privacy-focused dataset:

Limina 0.940/0.939/0.961/0.938 vs. AWS Comprehend 0.904/0.698/0.764/0.777 (Precision/Recall/Adj.Recall/F1)

Azure Language, MS Presidio and OpenAI PF results in the full methodology report

DETECTION ACCURACY

Head-to-head benchmark results

Metric

Limina

AWS Comprehend

Azure Language

MS Presidio

OpenAI PF

On-prem / VPC deploy

Data leaves environment

Never

Yes

Config-dep.

Never

Languages (async)

Never

3 + preview

1 native

~10

All 18 HIPAA identifiers

Unknown

Partial

8 of 18

Full PCI coverage

Unknown

Partial

Deterministic output

Partial

Independent audit

Never

INDEPENDENT VERIFICATION

Two Kinds of External Validation

Self-reported benchmarks and independently audited results answer different questions. For compliance teams evaluating vendors, the distinction matters.

Model Accuracy Audit

99.5%+

Independently validated by Armilla AI

Third-party audit of the detection engine itself — not a self-reported figure. Armilla AI warranted accuracy independently.

Regulatory — HIPAA Expert Determination

99.96%

Validated by Aetion / Datavant

Requires a qualified statistician to certify re-identification risk is statistically very small. Higher bar than HIPAA Safe Harbor.

54

Languages in active deployment

99.5%+

Independently validated accuracy

99.96%

HIPAA Expert Determination

18/18

HIPAA identifiers covered

INDEPENDENT VERIFICATION

What’s in the complete benchmark report

Detailed methodology, per-entity-type breakdowns, extended multilingual results across 8+ languages, and scoring documentation for compliance and security review.

Per-entity breakdowns

8-language recall

Per-entity breakdowns

HIPAA identifier mapping

Scoring methodology

Dataset

Benchmarks: ai4privacy 500k dataset, labels mapped to common schema, Limina v4.3.0-GPU, April 2026. No tool evaluated was trained on any portion of the evaluation dataset. AWS Comprehend, Azure Language Services, Microsoft Presidio, and OpenAI Privacy Filter are products of their respective owners.