When you evaluate a PII redaction tool, you'll almost always see an accuracy claim in the marketing material—90 percent, 95 percent, even higher. Not all accuracy claims are equal, however. The problem is not the number itself, but the absence of context behind it: what data type was tested, on whose dataset, and using what methodology. A figure can be both technically accurate and deeply misleading if it was measured on clean, generic text rather than the clinical notes, call transcripts, or financial documents your organization actually processes. But when those same tools are tested on real-world enterprise data—clinical transcripts, financial documents, contact center recordings, legal filings—independent benchmarks and production experience tell a different story.
The 70 percent figure used throughout this article comes from Limina's own benchmark evaluation of four major cloud PII detection platforms—AWS Comprehend, Azure Cognitive Services, Google DLP, and Microsoft Presidio—tested across 45,000 words of real-world enterprise data including call transcripts, medical notes, chat logs, and emails showed that aggregate recall for general-purpose tools ranged from 57 to 73 percent, with AWS Comprehend achieving 73 percent overall and 70 percent specifically on call transcript data. Full methodology and results are available in the Limina PII Detection Whitepaper. Because redaction can only remove what detection first identifies, detection recall and redaction recall are equivalent metrics for compliance purposes—if a tool misses PII at the detection stage, that PII will remain in your data after redaction.
For compliance-critical environments—healthcare, financial services, legal—a 70 percent recall rate is not a minor shortcoming. It's a systematic compliance failure. This article explains what drives the accuracy gap, how to measure it yourself, what the industry benchmarks actually show, and what a genuine commitment to high-accuracy redaction looks like.
How PII redaction accuracy is measured
Accuracy in PII detection is evaluated using standard information retrieval metrics. Understanding these metrics is essential for interpreting vendor claims and running your own evaluation.
| Metric |
What it measures |
Why it matters for redaction |
| Recall (sensitivity) |
Percentage of actual PII instances the tool correctly detected |
The most critical metric—missed PII is compliance exposure. Low recall = high risk. |
| Precision |
Percentage of detected items that are actually PII (not false positives) |
Matters for data utility—over-redaction removes content that wasn't PII, degrading downstream use. |
| F1 score |
Harmonic mean of recall and precision—balanced performance measure |
Useful for comparing tools across both dimensions, but recall should be weighted more heavily in compliance contexts. |
| Entity-level accuracy |
Accuracy broken down by PII type (names, SSNs, dates, etc.) |
Aggregate accuracy can hide weak performance on specific entity types. A tool with 95% overall may have 60% recall on medical record numbers. |
| Dataset-specific accuracy |
Performance measured on your actual data type |
Accuracy on generic benchmarks does not predict performance on your specific domain. Always test on representative samples. |
The single most important metric for compliance is recall. You can tolerate some over-redaction (false positives reduce data utility but don't create compliance exposure). You cannot tolerate under-redaction—false negatives leave PII in the dataset and expose individuals and your organization to harm. See our detailed guide for a deeper comparison of manual vs. automated PII redaction approaches and how each affects recall.
Why general-purpose tools achieve only 60–70 percent recall
The accuracy gap between general-purpose PII tools and purpose-built platforms is well-documented. Independent evaluations of cloud PII detection APIs on real-world clinical and enterprise data consistently find recall rates in the 60 to 70 percent range. Here's why.
Training data mismatch
General-purpose NLP models—including the PII detection APIs offered by major cloud providers—are trained primarily on mainstream text: news articles, Wikipedia, web pages, and lightly curated public datasets. Real-world enterprise PII appears in very different contexts: clinical notes written in abbreviated medical shorthand, call transcripts full of ASR errors, financial documents with industry-specific terminology, and multilingual customer communications.
A model that has never seen a medical record number, a pharmacy NPI, a SWIFT code, or a clinical diagnosis expressed in ICD-10 notation will not reliably detect them—not because the model is poorly designed, but because it was never trained on the data it's now being asked to process.
Entity type coverage gaps
Most general-purpose PII APIs detect a limited set of entity types—typically 10 to 20: names, email addresses, phone numbers, SSNs, credit card numbers, and a handful of others. Domain-specific PII extends far beyond this list. In healthcare alone, Protected Health Information (PHI) includes medical record numbers, health plan beneficiary numbers, device serial numbers, certificate and license numbers, and provider NPI numbers—none of which appear in standard PII API entity lists.
When an entity type isn't in the model's detection scope, it will never be caught. A tool that detects 18 entity types will have structural coverage gaps in any domain where PII routinely appears in additional forms.
Context-dependent PII is missed by rule-based detection
Pattern matching—regex-based detection—is effective for PII that follows a predictable format: 9-digit SSNs, 16-digit card numbers, email addresses. It fails completely on context-dependent PII: a person's name embedded in a sentence, a date of birth expressed conversationally, a patient's home city mentioned in a clinical note. Detecting contextual PII requires understanding language—NLP models trained for Named Entity Recognition (NER)—not just matching patterns.
General-purpose tools that rely heavily on pattern matching for efficiency will have systematically low recall on contextual and conversational PII, which is precisely the PII that appears most frequently in unstructured enterprise data.
Poor performance on domain-specific text formats
Clinical notes, financial filings, call transcripts, and legal documents each have distinctive text structures that differ significantly from the clean prose in generic training data. A clinical note might read: Pt. J.D., DOB 3/14/72, adm. 4/19/26 w/ Dx: T2DM, HTN. Rx: Metformin 500mg. A general-purpose model trained on news text has never encountered this structure and will miss multiple PHI elements that a healthcare-specialized model would catch immediately.
The real cost of missed PII
At 70 percent recall, the PII left in your datasets is not random noise—it's systematically the PII that's hardest to detect. Names embedded in context. Domain-specific identifiers. Records in minority languages. Entities that appear once, off-pattern.
Compliance exposure scenario: A healthcare organization uses a general-purpose PII API to de-identify 100,000 clinical notes before using them for AI model training. At 70 percent recall, approximately 30,000 PII instances remain in the dataset. The training data is distributed to a vendor's cloud environment for model development. Each unredacted instance is a potential HIPAA violation. The investigation cost alone for a HIPAA breach of this scale—independent of penalties—can reach millions of dollars.
The compliance exposure is only part of the cost. Missed PII in AI training data creates models that learn to associate certain individuals with medical conditions, financial situations, or behavioral patterns—re-identification risks embedded in the model itself. The downstream risk isn't just the training data; it's every inference the model makes.
For data sharing and research use cases, missed PII creates re-identification risk in published datasets. Research published with insufficiently de-identified data has been repeatedly shown to be re-identifiable through combination attacks—where quasi-identifiers like age, ZIP code, and gender can uniquely identify individuals even when direct identifiers are removed.
What does 99.5 percent accuracy actually mean?
Limina's 99.5 percent accuracy claim on physician conversations is a specific, domain-tested benchmark—not a marketing figure derived from a favorable dataset. Independent PII detection benchmark testing consistently demonstrates the performance gap between purpose-built models and general-purpose APIs. Here's what the numbers mean in practice.
99.5 percent recall on 1,000 PII instances means five instances are missed. At 70 percent recall, 300 instances are missed. The difference between these two numbers is the difference between a compliant de-identification pipeline and a systematic exposure risk.
The 99.5 percent figure reflects Limina's model performance on physician conversation data—one of the most challenging PII detection environments in enterprise use. Physician conversations are fast, heavily abbreviated, use domain-specific terminology, and often contain quasi-identifiers expressed conversationally. A tool that achieves 99.5 percent accuracy on this data type will perform as well or better on most other enterprise data types.
| Data volume |
PII instances missed at 70% recall |
PII instances missed at 99.5% recall |
Difference |
| 10,000 documents |
3,000 PII exposed |
50 PII exposed |
2,950 fewer exposures |
| 100,000 documents |
30,000 PII exposed |
500 PII exposed |
29,500 fewer exposures |
| 1,000,000 documents |
300,000 PII exposed |
5,000 PII exposed |
295,000 fewer exposures |
How to evaluate PII redaction accuracy for your use case
Vendor accuracy claims are a starting point, not a conclusion. Before committing to a redaction tool, run your own accuracy evaluation on a representative sample of your actual data. Here's how.
- Build a test dataset. Select 100 to 500 documents representative of your production data: the same document types, the same domain vocabulary, the same mix of structured and unstructured content, the same languages.
- Annotate ground truth. Have a subject matter expert manually identify and label every PII instance in the test dataset. This annotation is your ground truth. It's time-consuming but essential.
- Run candidate tools. Process the same annotated test dataset through each tool you're evaluating. Record the output: what was detected, what type, and where.
- Calculate recall by entity type. For each PII category (names, dates, SSNs, medical identifiers, etc.), calculate the percentage that was correctly detected. Aggregate accuracy hides entity-level weaknesses.
- Calculate precision. Review detections for false positives: content flagged as PII that isn't. High false positive rates degrade data utility and create excessive manual review workload.
- Test edge cases. Include documents with rare or unusual PII presentations: names in non-Latin scripts, dates in multiple formats, industry-specific identifiers, low-context quasi-identifiers.
- Compare total exposure. Multiply your estimated PII density (instances per document) by your expected document volume, then apply each tool's recall rate. The absolute number of missed PII instances tells a clearer story than a percentage.
Accuracy benchmarks by data type
PII detection accuracy varies significantly by data type. The table below provides context when evaluating tools and interpreting vendor claims. For a review of how de-identification affects AI model accuracy across different data types, the research evidence is consistently reassuring.
| Data type |
Typical accuracy of general tools |
Challenges |
Limina performance |
| Typed, structured documents (forms, templates) |
80–90% |
Labeled fields usually detected; free-text fields missed |
99%+ |
| Unstructured email and documents |
70–80% |
Contextual PII, embedded names, quasi-identifiers |
99%+ |
| Clinical notes and medical records |
60–75% |
Medical abbreviations, domain-specific identifiers, PHI in context |
99.5%+ |
| Call transcripts and ASR output |
55–70% |
ASR errors, spoken digit patterns, phonetic name variants |
99.5%+ |
| Multilingual documents |
50–70% |
Model performance degrades for non-English languages |
Supports 52 languages |
| Scanned PDFs (after OCR) |
60–75% |
OCR errors compound detection difficulty |
99%+ with integrated OCR pipeline |
The accuracy gap is a compliance gap
Limina's platform achieves 99.5 percent or higher accuracy on real-world healthcare and enterprise data—not because of a single technique, but because of models trained specifically on the data types, languages, and PII patterns found in production environments. Coupled with in-VPC deployment and 50+ entity type coverage across 52 languages, it's the platform built for compliance-critical accuracy requirements.
Get a demo at getlimina.ai/en/contact-us