April 14, 2026
.

PII Redaction Accuracy: Why 70% Is Not Good Enough

PII redaction accuracy measures how reliably a tool detects and removes all personally identifiable information from a dataset. In practice, it's expressed as recall—the percentage of PII instances in a document that the tool successfully identifies. A tool with 70 percent recall misses 30 percent of PII. On a dataset with 1,000 PII instances, that's 300 exposed records.

Limina
Company
PII Redaction Accuracy

When you evaluate a PII redaction tool, you'll almost always see an accuracy claim in the marketing material—90 percent, 95 percent, even higher. Not all accuracy claims are equal, however. The problem is not the number itself, but the absence of context behind it: what data type was tested, on whose dataset, and using what methodology. A figure can be both technically accurate and deeply misleading if it was measured on clean, generic text rather than the clinical notes, call transcripts, or financial documents your organization actually processes. But when those same tools are tested on real-world enterprise data—clinical transcripts, financial documents, contact center recordings, legal filings—independent benchmarks and production experience tell a different story.

The 70 percent figure used throughout this article comes from Limina's own benchmark evaluation of four major cloud PII detection platforms—AWS Comprehend, Azure Cognitive Services, Google DLP, and Microsoft Presidio—tested across 45,000 words of real-world enterprise data including call transcripts, medical notes, chat logs, and emails showed that aggregate recall for general-purpose tools ranged from 57 to 73 percent, with AWS Comprehend achieving 73 percent overall and 70 percent specifically on call transcript data. Full methodology and results are available in the Limina PII Detection Whitepaper. Because redaction can only remove what detection first identifies, detection recall and redaction recall are equivalent metrics for compliance purposes—if a tool misses PII at the detection stage, that PII will remain in your data after redaction.

For compliance-critical environments—healthcare, financial services, legal—a 70 percent recall rate is not a minor shortcoming. It's a systematic compliance failure. This article explains what drives the accuracy gap, how to measure it yourself, what the industry benchmarks actually show, and what a genuine commitment to high-accuracy redaction looks like.

How PII redaction accuracy is measured

Accuracy in PII detection is evaluated using standard information retrieval metrics. Understanding these metrics is essential for interpreting vendor claims and running your own evaluation.

Metric What it measures Why it matters for redaction
Recall (sensitivity) Percentage of actual PII instances the tool correctly detected The most critical metric—missed PII is compliance exposure. Low recall = high risk.
Precision Percentage of detected items that are actually PII (not false positives) Matters for data utility—over-redaction removes content that wasn't PII, degrading downstream use.
F1 score Harmonic mean of recall and precision—balanced performance measure Useful for comparing tools across both dimensions, but recall should be weighted more heavily in compliance contexts.
Entity-level accuracy Accuracy broken down by PII type (names, SSNs, dates, etc.) Aggregate accuracy can hide weak performance on specific entity types. A tool with 95% overall may have 60% recall on medical record numbers.
Dataset-specific accuracy Performance measured on your actual data type Accuracy on generic benchmarks does not predict performance on your specific domain. Always test on representative samples.

The single most important metric for compliance is recall. You can tolerate some over-redaction (false positives reduce data utility but don't create compliance exposure). You cannot tolerate under-redaction—false negatives leave PII in the dataset and expose individuals and your organization to harm. See our detailed guide for a deeper comparison of manual vs. automated PII redaction approaches and how each affects recall.

Why general-purpose tools achieve only 60–70 percent recall

The accuracy gap between general-purpose PII tools and purpose-built platforms is well-documented. Independent evaluations of cloud PII detection APIs on real-world clinical and enterprise data consistently find recall rates in the 60 to 70 percent range. Here's why.

Training data mismatch

General-purpose NLP models—including the PII detection APIs offered by major cloud providers—are trained primarily on mainstream text: news articles, Wikipedia, web pages, and lightly curated public datasets. Real-world enterprise PII appears in very different contexts: clinical notes written in abbreviated medical shorthand, call transcripts full of ASR errors, financial documents with industry-specific terminology, and multilingual customer communications.

A model that has never seen a medical record number, a pharmacy NPI, a SWIFT code, or a clinical diagnosis expressed in ICD-10 notation will not reliably detect them—not because the model is poorly designed, but because it was never trained on the data it's now being asked to process.

Entity type coverage gaps

Most general-purpose PII APIs detect a limited set of entity types—typically 10 to 20: names, email addresses, phone numbers, SSNs, credit card numbers, and a handful of others. Domain-specific PII extends far beyond this list. In healthcare alone, Protected Health Information (PHI) includes medical record numbers, health plan beneficiary numbers, device serial numbers, certificate and license numbers, and provider NPI numbers—none of which appear in standard PII API entity lists.

When an entity type isn't in the model's detection scope, it will never be caught. A tool that detects 18 entity types will have structural coverage gaps in any domain where PII routinely appears in additional forms.

Context-dependent PII is missed by rule-based detection

Pattern matching—regex-based detection—is effective for PII that follows a predictable format: 9-digit SSNs, 16-digit card numbers, email addresses. It fails completely on context-dependent PII: a person's name embedded in a sentence, a date of birth expressed conversationally, a patient's home city mentioned in a clinical note. Detecting contextual PII requires understanding language—NLP models trained for Named Entity Recognition (NER)—not just matching patterns.

General-purpose tools that rely heavily on pattern matching for efficiency will have systematically low recall on contextual and conversational PII, which is precisely the PII that appears most frequently in unstructured enterprise data.

Poor performance on domain-specific text formats

Clinical notes, financial filings, call transcripts, and legal documents each have distinctive text structures that differ significantly from the clean prose in generic training data. A clinical note might read: Pt. J.D., DOB 3/14/72, adm. 4/19/26 w/ Dx: T2DM, HTN. Rx: Metformin 500mg. A general-purpose model trained on news text has never encountered this structure and will miss multiple PHI elements that a healthcare-specialized model would catch immediately.

The real cost of missed PII

At 70 percent recall, the PII left in your datasets is not random noise—it's systematically the PII that's hardest to detect. Names embedded in context. Domain-specific identifiers. Records in minority languages. Entities that appear once, off-pattern.

Compliance exposure scenario: A healthcare organization uses a general-purpose PII API to de-identify 100,000 clinical notes before using them for AI model training. At 70 percent recall, approximately 30,000 PII instances remain in the dataset. The training data is distributed to a vendor's cloud environment for model development. Each unredacted instance is a potential HIPAA violation. The investigation cost alone for a HIPAA breach of this scale—independent of penalties—can reach millions of dollars.

The compliance exposure is only part of the cost. Missed PII in AI training data creates models that learn to associate certain individuals with medical conditions, financial situations, or behavioral patterns—re-identification risks embedded in the model itself. The downstream risk isn't just the training data; it's every inference the model makes.

For data sharing and research use cases, missed PII creates re-identification risk in published datasets. Research published with insufficiently de-identified data has been repeatedly shown to be re-identifiable through combination attacks—where quasi-identifiers like age, ZIP code, and gender can uniquely identify individuals even when direct identifiers are removed.

What does 99.5 percent accuracy actually mean?

Limina's 99.5 percent accuracy claim on physician conversations is a specific, domain-tested benchmark—not a marketing figure derived from a favorable dataset. Independent PII detection benchmark testing consistently demonstrates the performance gap between purpose-built models and general-purpose APIs. Here's what the numbers mean in practice.

99.5 percent recall on 1,000 PII instances means five instances are missed. At 70 percent recall, 300 instances are missed. The difference between these two numbers is the difference between a compliant de-identification pipeline and a systematic exposure risk.

The 99.5 percent figure reflects Limina's model performance on physician conversation data—one of the most challenging PII detection environments in enterprise use. Physician conversations are fast, heavily abbreviated, use domain-specific terminology, and often contain quasi-identifiers expressed conversationally. A tool that achieves 99.5 percent accuracy on this data type will perform as well or better on most other enterprise data types.

Data volume PII instances missed at 70% recall PII instances missed at 99.5% recall Difference
10,000 documents 3,000 PII exposed 50 PII exposed 2,950 fewer exposures
100,000 documents 30,000 PII exposed 500 PII exposed 29,500 fewer exposures
1,000,000 documents 300,000 PII exposed 5,000 PII exposed 295,000 fewer exposures

How to evaluate PII redaction accuracy for your use case

Vendor accuracy claims are a starting point, not a conclusion. Before committing to a redaction tool, run your own accuracy evaluation on a representative sample of your actual data. Here's how.

  • Build a test dataset. Select 100 to 500 documents representative of your production data: the same document types, the same domain vocabulary, the same mix of structured and unstructured content, the same languages.
  • Annotate ground truth. Have a subject matter expert manually identify and label every PII instance in the test dataset. This annotation is your ground truth. It's time-consuming but essential.
  • Run candidate tools. Process the same annotated test dataset through each tool you're evaluating. Record the output: what was detected, what type, and where.
  • Calculate recall by entity type. For each PII category (names, dates, SSNs, medical identifiers, etc.), calculate the percentage that was correctly detected. Aggregate accuracy hides entity-level weaknesses.
  • Calculate precision. Review detections for false positives: content flagged as PII that isn't. High false positive rates degrade data utility and create excessive manual review workload.
  • Test edge cases. Include documents with rare or unusual PII presentations: names in non-Latin scripts, dates in multiple formats, industry-specific identifiers, low-context quasi-identifiers.
  • Compare total exposure. Multiply your estimated PII density (instances per document) by your expected document volume, then apply each tool's recall rate. The absolute number of missed PII instances tells a clearer story than a percentage.

Accuracy benchmarks by data type

PII detection accuracy varies significantly by data type. The table below provides context when evaluating tools and interpreting vendor claims. For a review of how de-identification affects AI model accuracy across different data types, the research evidence is consistently reassuring.

Data type Typical accuracy of general tools Challenges Limina performance
Typed, structured documents (forms, templates) 80–90% Labeled fields usually detected; free-text fields missed 99%+
Unstructured email and documents 70–80% Contextual PII, embedded names, quasi-identifiers 99%+
Clinical notes and medical records 60–75% Medical abbreviations, domain-specific identifiers, PHI in context 99.5%+
Call transcripts and ASR output 55–70% ASR errors, spoken digit patterns, phonetic name variants 99.5%+
Multilingual documents 50–70% Model performance degrades for non-English languages Supports 52 languages
Scanned PDFs (after OCR) 60–75% OCR errors compound detection difficulty 99%+ with integrated OCR pipeline

The accuracy gap is a compliance gap

Limina's platform achieves 99.5 percent or higher accuracy on real-world healthcare and enterprise data—not because of a single technique, but because of models trained specifically on the data types, languages, and PII patterns found in production environments. Coupled with in-VPC deployment and 50+ entity type coverage across 52 languages, it's the platform built for compliance-critical accuracy requirements.

Get a demo at getlimina.ai/en/contact-us

Related Articles

Frequently Asked Questions

What is a good recall rate for PII redaction?

For compliance-critical use cases—healthcare, financial services, legal—a recall rate below 95 percent is generally insufficient. HIPAA's de-identification standard requires that re-identification risk be 'very small.' At 70 percent recall, hundreds of PHI instances per thousand may remain in a dataset, making 'very small' re-identification risk impossible to claim. For enterprise production use, a recall rate of 99 percent or higher on your specific data type is the right target. Limina achieves 99.5 percent on physician conversations.

What's the difference between recall and accuracy in PII detection?

'Accuracy' is a general term that can refer to different metrics depending on how it's used. In PII detection, recall (also called sensitivity) measures the percentage of all actual PII instances that the system correctly detected. Precision measures the percentage of detected items that are genuinely PII. F1 is the harmonic mean of both. For compliance purposes, recall is the primary metric—missed PII creates regulatory exposure regardless of how high precision is.

How do vendor accuracy benchmarks compare to real-world performance?

Vendor benchmarks are typically measured on curated, clean benchmark datasets that are more structured and easier to process than real production data. Performance on your actual data—with its domain-specific terminology, mixed formats, and edge cases—is almost always lower than the benchmark figure. This is why independent testing on a representative sample of your own data is essential before committing to a tool.

Does high accuracy mean high false positives?

Not necessarily—and this is a common misconception. High recall (catching more PII) does not inherently mean high false positive rates (flagging non-PII as PII). A well-designed model achieves high recall while maintaining manageable precision through better contextual understanding. The goal is high recall and high precision simultaneously. Tools that achieve high recall only by over-redacting everything are not genuinely high-accuracy—they're shifting the problem from compliance risk to data utility loss

What types of PII are most commonly missed by general-purpose tools?

The most consistently missed PII categories include: domain-specific identifiers (medical record numbers, health plan IDs, NPI numbers), contextual names (names embedded in narrative text without surrounding context), quasi-identifiers (ZIP code, birth date, and gender in combination), dates expressed conversationally, foreign-language PII, and ASR-transcribed PII with phonetic errors. These are precisely the categories that require domain-trained models rather than general-purpose pattern matching.