September 22, 2025

How to Properly Benchmark PII Detection Solutions: A Research-Based Methodology

Choosing the wrong PII detection vendor can expose your organization to serious compliance and security risks. This post outlines Limina's research-based benchmarking methodology, covering datasets, precision and recall metrics, and real-world testing, so you can evaluate solutions with confidence.

Patricia Thaine

Founder, Chairwoman, Thought Leader

Most organizations that evaluate PII detection vendors do so without a rigorous, standardized method for comparing accuracy. They rely on vendor-supplied demos, sales benchmarks, or limited internal testing that rarely reflects the complexity of real-world data environments. The result is poor vendor selection that can expose organizations to significant compliance failures, data breaches, and regulatory penalties.

At Limina, we have seen this problem firsthand. It motivated us to develop a transparent, research-grade benchmarking methodology that organizations can use to evaluate PII detection solutions on a fair and consistent basis. We introduced the problem in depth in The Hidden PII Detection Crisis: Why Traditional Methods Are Failing Your Business, but the question that follows is: how do you prove it scientifically? This article outlines exactly how we do it.

Whether you are evaluating vendors for a healthcare compliance program, a financial services data governance initiative, or any other context where sensitive data is in play, this methodology gives you the tools to make an informed decision.

What makes PII detection benchmarking so difficult?

The challenge with benchmarking PII detection is not just technical, it is structural. Different vendors support different entity types. Some services detect a broad range of identifiers such as names, addresses, social security numbers, dates of birth, and account numbers. Others cover a narrower set. When you compare two systems that do not share identical entity type coverage, a naive comparison will favor the service that covers fewer categories, simply because it has fewer opportunities to make mistakes.

There is also the problem of how entity types are defined. One service might label only the numeric value in the phrase "61 years old" as an AGE entity, while another captures the full phrase including the word "years." These definitional differences are not errors, they are design choices, but they produce results that cannot be fairly compared without careful normalization.

These are the exact problems Limina's benchmarking methodology is designed to address.

What datasets does Limina use for benchmarking?

The foundation of any meaningful benchmark is the quality and breadth of the test data. Limina's whitepaper details comparisons performed on a test dataset of approximately 45,000 words of English text data that were manually annotated and verified at least three times. Manual annotation with multiple rounds of verification is not a shortcut approach. It is the standard required to produce ground-truth labels reliable enough to serve as a benchmark baseline.

Beyond the general dataset, the comparison was also performed on targeted subsets of domain-specific data. These domain-specific datasets were chosen because they represent some of the most challenging real-world scenarios for PII detection. Clinical notes, financial records, and customer service transcripts each contain PII in forms that general-purpose models frequently miss: informal abbreviations, partial identifiers, context-dependent references, and domain-specific terminology that looks like ordinary language to an untrained system.

This matters enormously for organizations in regulated industries. A pharma and life sciences company processing patient-reported outcomes data faces a fundamentally different detection challenge than a general enterprise running HR documents through a redaction pipeline. Benchmarking against domain-specific data surfaces the gaps that matter most for your use case, and it is why Limina makes its evaluation toolkit available to customers and serious prospects upon request.

How does Limina's method of comparison work?

The core principle of Limina's comparison methodology is apples-to-apples measurement. Rather than comparing every entity type across every service regardless of what each service supports, Limina evaluates only the entity types common to both Limina and the competing service.

For example, when comparing Limina against AWS Comprehend, the comparison only considers entity types that both services support. If an entity type is supported by only one service, it is omitted from the comparison metrics entirely. AWS Comprehend has no ACCOUNT_NUMBER entity type, so that class is excluded from any comparison between the two systems. This prevents the comparison from unfairly penalizing a more capable service for attempting to detect entity types the competing service does not even try to identify.

How does entity type mapping work in the benchmark?

Entity type mapping addresses a related problem: what happens when one service uses a single broad category where another uses several more granular ones?

Limina handles this by mapping the more fine-grained entity types of one service to the corresponding single entity type of the other. For instance, AWS Comprehend uses a single DATE_TIME entity type. Limina's system, by contrast, supports separate entity types for DATE, DATE_INTERVAL, DOB (date of birth), and TIME. When comparing the two, all predictions across these four Limina categories are mapped to the unified DATE_TIME category in the comparison metrics. This normalization ensures that Limina's finer-grained approach to entity classification does not introduce artificial differences in the performance measurement.

It is also worth noting that definitional differences between services can affect results in ways that cannot be fully resolved through mapping. In cases where two services define the boundaries of an entity differently, the comparison results should be interpreted with appropriate nuance. However, this caveat does not apply to more precisely defined or heavily formatted entity types such as credit card numbers, bank account numbers, and social security numbers, where the boundaries are unambiguous. To further reduce the impact of boundary discrepancies, Limina computes all metrics at the word level rather than the character or span level.

What evaluation metrics does Limina use?

Choosing the right metrics is as important as choosing the right dataset. Limina uses four primary evaluation metrics for PII detection benchmarking, each of which captures a different aspect of system performance.

Precision measures the fraction of all entities of a given type predicted by the model that are actually that entity type, as confirmed by the labeled test set. For a NAME entity type, precision answers the question: of all the things the model called a name, how many actually were names? High precision means the model is not generating excessive false positives.

Recall measures the fraction of all real entities of a given type that were correctly identified by the model. For a NAME entity type, recall answers the question: of all the names that actually appear in the text, how many did the model find? High recall means the model is not missing real PII.

F1 is the harmonic mean of precision and recall. It provides a single score that balances both dimensions, penalizing systems that perform well on one at the expense of the other.

Support refers to the number of instances of a given entity type that appear in the test set. Support matters for interpretation: a perfect recall score on an entity type with only three instances in the dataset means something very different from perfect recall on an entity type with 3,000 instances.

What do "PII missed" metrics measure?

Limina also tracks two additional metrics that capture the real-world impact of missed detections.

PII missed as a percentage of all words measures the fraction of PII characters incorrectly predicted as non-PII, divided by the total number of characters in the test set. This metric is useful for understanding what proportion of any given document might be leaking sensitive information, but it can obscure the model's true error rate in documents with low PII density.

PII missed as a percentage of PII entities measures the fraction of PII characters incorrectly predicted as non-PII, divided by the total number of PII characters in the test set. This metric is more representative of the actual error rate within the PII content itself. It tells you how often the model is failing on the entities it is specifically designed to detect.

Why does recall matter more than precision for PII detection?

For most machine learning tasks, precision and recall are treated as roughly equivalent concerns and F1 is used as a neutral compromise. PII detection is different.

At Limina, recall is the critical metric. The reason is asymmetric risk. When a PII detection system produces a false positive, it might redact a word that was not sensitive. That is a minor inconvenience: the document is slightly over-redacted, but no sensitive information escapes. When a PII detection system produces a false negative, it misses a real piece of PII. That missed entity can cause a data breach, enable identity theft, or trigger serious legal consequences under regulations like HIPAA, GDPR, or CCPA.

This asymmetry defines how Limina builds and evaluates its data de-identification solution. Minimizing false negatives is not a product choice, it is a compliance imperative. Organizations that choose a PII detection vendor based on overall accuracy or F1 scores without examining recall specifically are accepting unknown levels of missed PII exposure.

If your organization handles protected health information, financial records, or any other category of personally identifiable data, this distinction should be at the center of your vendor evaluation process. Reach out to Limina's team to discuss how recall-focused PII detection can reduce your organization's compliance risk.

How should you apply this methodology to vendor selection?

The methodology described here is not academic. Limina uses it continuously as part of its own product development cycle. The full, ever-growing and rotating test dataset used to measure performance prior to any new release is larger, more varied, and includes significant amounts of multilingual data. The rotation prevents overfitting: a system that is tuned to perform well on a static benchmark may fail in production, while one evaluated against a diverse and evolving dataset demonstrates genuine generalization.

When evaluating any PII detection vendor, the questions you should be asking are grounded in this framework. What datasets were used to generate their accuracy claims? Were those datasets manually annotated and independently verified? Do their benchmark comparisons restrict analysis to entity types common to both systems? What is their recall on domain-specific data relevant to your industry? How do they define entity type boundaries, and how does that affect cross-service comparisons?

For organizations in contact centers and insurance, where high call and document volumes create enormous surface areas for PII exposure, these questions are not optional due diligence. They are essential to understanding whether the system you deploy will actually protect your customers and your organization.

Limina's evaluation toolkit and the datasets annotated for this benchmarking work are available to customers and serious prospects upon request. If you want to see how Limina performs on data that looks like yours, not on a vendor-curated demo, contact us to get access.

‍

Share this post

Copy link

Frequently Asked Questions

What is the most important metric for evaluating PII detection accuracy?

Recall is the most important metric for PII detection. It measures how many real instances of sensitive data the system successfully identifies. A high recall means fewer missed entities, which directly reduces the risk of data leakage, compliance violations, and breach incidents. While precision and F1 scores are useful for general model evaluation, the asymmetric consequences of missing PII make recall the primary metric for any organization handling regulated or sensitive data.

‍

How does Limina ensure fair comparisons between PII detection services?

Limina's benchmarking methodology restricts comparisons to entity types supported by both services being evaluated. If a competing system does not support a given entity type, that type is excluded from the comparison metrics entirely. When one service uses broader entity categories than another, Limina maps fine-grained entity types to their equivalent broader categories before computing scores. Metrics are also calculated at the word level to reduce the impact of boundary definition differences between systems.

‍

What datasets are used in Limina's PII detection benchmark?

Limina's published benchmark comparisons are based on approximately 45,000 words of English text data that were manually annotated and verified at least three times, as well as targeted subsets of domain-specific data. The full internal test dataset used for ongoing product evaluation is larger, more varied, and includes substantial multilingual content. The dataset rotates regularly to prevent overfitting and ensure results reflect real-world generalization rather than benchmark tuning.

‍

Why is domain-specific data important for PII detection benchmarking?

PII appears differently across domains. Clinical notes, financial records, customer service transcripts, and legal documents each contain sensitive information in formats and contexts that differ substantially from general English text. A system that performs well on a general benchmark may miss critical identifiers in domain-specific content. Benchmarking against data that reflects your actual use case, whether that is healthcare, financial services, pharma, insurance, or contact centers, provides a more accurate picture of how a system will perform in production.

‍

What is the difference between "PII missed as a percentage of all words" and "PII missed as a percentage of PII entities"?

PII missed as a percentage of all words measures how much sensitive content goes undetected relative to the total volume of text in the document. This metric reflects potential leakage risk from a document-level perspective. PII missed as a percentage of PII entities measures the same missed detections relative only to the total PII content in the document. This second metric is a more direct measure of the model's accuracy on the specific task it is designed to perform, since it is not diluted by the proportion of non-PII text in the dataset.

‍