February 14, 2025

EDPB’s Pseudonymization Guideline and the Challenge of Unstructured Data

The EDPB's Guidelines 01/2025 on Pseudonymisation offer detailed guidance on GDPR-compliant pseudonymization, but they sidestep one of the hardest practical challenges: reliably identifying and replacing personal identifiers in unstructured data. This article unpacks what the guidelines say, what they leave out, and how organizations can close the gap.

Patricia Graciano

The European Data Protection Board (EDPB) released its comprehensive Guidelines 01/2025 on Pseudonymisation as a detailed framework for organizations working to apply pseudonymization correctly under the General Data Protection Regulation (GDPR). The document covers not only what pseudonymization is, but also how it intersects with adjacent data protection principles including data minimization, purpose limitation, and privacy by design and by default.

The guidelines are a meaningful step forward for organizations that have been navigating GDPR compliance with limited prescriptive guidance on technical implementation. Pseudonymization, as defined under GDPR Article 4(5), refers to the processing of personal data in such a manner that the data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and subject to technical and organizational measures. What the EDPB guidelines make clear is that achieving this standard requires more than simply swapping out names and ID numbers. It demands a coordinated set of technical and organizational safeguards, and it begins with a step that the guidelines treat as given but is anything but: accurately detecting and replacing every personal identifier in the data.

This article highlights one instructive example from the guidelines, surfaces a challenge the document underexplores, and explains how organizations can meet that challenge in practice, particularly when dealing with unstructured data formats.

How does pseudonymization work under GDPR?

Pseudonymization under GDPR is not a binary state. It is a risk-reduction technique that, when properly implemented, reduces the likelihood that data can be re-identified. It does not render data anonymous, which means it does not place data outside the scope of GDPR. Rather, it reduces the risks associated with processing and enables certain flexibilities under the regulation, such as supporting further processing for compatible purposes or qualifying as a safeguard in data protection impact assessments.

Achieving genuine pseudonymization requires controllers to define their pseudonymization domain: who should be prevented from linking the pseudonymized data back to an individual, and what threats exist from actors both inside and outside that domain. From there, appropriate technical and organizational measures must be deployed. These include securely managing any lookup tables that link pseudonyms to real identities, enforcing strict data flow controls, and applying cryptographic or tokenization techniques appropriate to the identified risks.

The EDPB guidelines walk through all of this in commendable depth. What they assume, however, is that organizations have already solved the upstream problem: finding the personal identifiers in the data before the pseudonymization process can begin. In structured data environments, such as relational databases with clearly defined fields, this assumption is relatively safe. In unstructured data environments, it is not.

What does Example 3 of the EDPB guidelines illustrate?

Among the ten worked examples in the EDPB guidelines, Example 3 is particularly instructive for healthcare and life sciences organizations. It describes a dental implant register designed to collect patient data for purposes of quality monitoring, practitioner feedback, and continuity of care. The multi-step pseudonymization procedure it describes is genuinely well-designed.

In this scenario, dentists collect patient data including identifiers, implant details, and medical information, then transmit it to the Register tagged with temporary pseudonyms. A Trust Centre receives the data, replaces the temporary pseudonyms with permanent ones, and manages the lookup table that connects those pseudonyms to patient identities. The Register uses the pseudonymized data to analyze implant quality and generate aggregated feedback for practices, while patient identities remain protected throughout. When subsequent caregivers require access to a patient's implant history, they can request it through controlled channels governed by the Trust Centre.

It is an elegantly orchestrated process, and the organizational architecture the EDPB describes, with its separation of roles between the Register and the Trust Centre, reflects a mature approach to managing re-identification risk. Organizations in healthcare and pharma and life sciences will recognize this pattern as applicable to clinical data registries, real-world evidence programs, and post-market surveillance systems.

However, Example 3 quietly assumes that the personal identifiers in the dental data can be reliably located and replaced before any of this organizational machinery comes into play. That assumption deserves scrutiny.

Why is unstructured data a challenge for pseudonymization?

The practical difficulty the EDPB guidelines leave largely unaddressed is the nature of the data itself. A meaningful portion of the data flowing through clinical registers, patient intake systems, and medical records environments is unstructured. It does not arrive in neatly labeled database fields. It arrives in free-text clinical notes, scanned referral letters, handwritten treatment summaries, PDF attachments, and email correspondence between practitioners.

Unstructured data does not follow predefined formats, and personal identifiers within it can appear in unexpected contexts. A patient's name might appear mid-sentence in a dentist's narrative notes. A national health insurance number might be embedded in a scanned document image. A date of birth might be referenced incidentally in a description of a patient's medical history. In multilingual documents or those with non-standard formatting, the challenge compounds further.

This is not a minor operational detail. If an identifier is missed during the pseudonymization step, the entire downstream process, however well-designed, cannot compensate. A lookup table managed by a Trust Centre offers no protection against a re-identification risk that was never removed from the source data in the first place. Overlooking even a single identifier can weaken the pseudonymization process and increase the risk of non-compliance.

This is the gap the EDPB guidelines do not fill, and it is where organizations most commonly encounter difficulty in real-world implementation.

If your organization is working through these challenges now, speak with a Limina expert to understand how automated de-identification can support your pseudonymization program.

How does Limina address personal identifier detection in unstructured data?

Limina's data de-identification technology is built specifically to handle this problem. Rather than relying on pattern matching alone, Limina's solution is built by linguists and trained to understand language in context, which means it can identify personal identifiers in the kinds of messy, real-world data environments that rule-based tools frequently miss.

The technology operates across three core capabilities relevant to pseudonymization workflows.

The first is detection. Limina's machine learning models identify personal data across a wide range of entity types, including names, health insurance numbers, payment information, dates of birth, contact details, and other direct identifiers, even when they appear in free-text fields, scanned documents, handwritten notes, or multilingual content. The context-aware architecture means that a name embedded in a narrative sentence is treated differently from a name in a structured field, and both are reliably flagged.

The second capability is reporting. Before any replacement occurs, Limina can generate a detailed inventory of the personal data found within a dataset. This reporting function is not only operationally useful, it is directly relevant to GDPR compliance. The EDPB guidelines make clear that controllers must assess the risks associated with their data before determining what technical safeguards are appropriate. A thorough PII report supports that assessment, and it can include indirect identifiers, such as medical history details or diagnostic information, that should be documented but not necessarily pseudonymized.

The third capability is pseudonymization itself. Once identifiers are detected and the appropriate scope has been determined, Limina replaces or removes them with pseudonyms or placeholders. This constitutes the critical first step in a compliant pseudonymization workflow, the step that must be completed before any organizational measures, lookup tables, or access controls can function as intended.

For organizations in financial services, contact centers, and insurance, the same challenge applies to different data types: customer correspondence, call transcripts, claims documents, and underwriting files all contain unstructured text where personal identifiers can be embedded in ways that structured-data tools are not designed to handle.

What other measures are required for GDPR-compliant pseudonymization?

Automated identifier detection addresses a critical bottleneck, but it is one component of a broader compliance program. The EDPB guidelines are explicit that pseudonymization is not a single technical action; it is a set of coordinated measures that must be tailored to the pseudonymization domain.

Controllers must begin by defining that domain clearly: determining who within and outside the organization could potentially re-identify individuals from the pseudonymized data, and assessing what risks they pose. This threat model informs every subsequent decision about which technical safeguards to deploy.

From there, the guidelines recommend a range of organizational and technical measures. Lookup tables linking pseudonyms to real identities must be managed securely, with access strictly limited and cryptographic protections applied. Data flow controls must prevent pseudonymized data from being recombined with additional information in ways that could enable re-identification. Role separation, as illustrated in the Trust Centre model from Example 3, is a proven structural approach to enforcing these controls at scale.

The EDPB guidelines themselves are an excellent reference for the full scope of these requirements. What they do not resolve, and what Limina directly addresses, is the starting point: ensuring that the data has actually been stripped of its personal identifiers before any of these measures are applied.

Organizations that skip or underinvest in that first step will find that even the most sophisticated pseudonymization architecture cannot compensate for identifiers that were never removed.

To see how Limina can fit into your organization's data protection program, contact the Limina team to discuss your specific data environment and compliance requirements.

Share this post

Copy link

Frequently Asked Questions

What is pseudonymization under GDPR?

Pseudonymization under GDPR is the processing of personal data in such a way that it can no longer be attributed to a specific individual without the use of additional information. That additional information must be kept separately and protected by technical and organizational measures. Pseudonymized data is still considered personal data under GDPR, so the regulation continues to apply, but pseudonymization can reduce compliance risk and support certain data processing activities.

‍

What is the difference between pseudonymization and anonymization?

Anonymization removes all means of re-identification, placing data entirely outside the scope of GDPR. Pseudonymization replaces or removes direct identifiers but retains the possibility of re-identification using separately held additional information. Because re-identification remains theoretically possible with pseudonymized data, it does not enjoy the same regulatory treatment as truly anonymous data.

‍

What are the EDPB Guidelines 01/2025 on Pseudonymisation?

The EDPB Guidelines 01/2025 on Pseudonymisation are a document published by the European Data Protection Board providing detailed, practical guidance on how organizations should implement pseudonymization under GDPR. The guidelines cover the relationship between pseudonymization and other data protection principles, worked examples across different sectors, and recommendations for the technical and organizational measures that should accompany a pseudonymization program.

‍

Why is unstructured data a problem for pseudonymization?

Unstructured data, such as free-text clinical notes, emails, scanned documents, and call transcripts, does not organize personal identifiers into predictable, labeled fields. This makes automated detection more difficult. Identifiers can appear in unexpected places and in contextually embedded forms that simple pattern-matching tools miss. If an identifier is overlooked during the detection step, the downstream pseudonymization process cannot compensate, and the data remains at risk of re-identification.

‍

How does Limina help with pseudonymization of unstructured data?

Limina's data de-identification technology uses context-aware machine learning models, built by linguists, to detect and replace personal identifiers in unstructured data. This includes free-text documents, scanned records, multilingual content, and other formats where identifiers can be difficult to locate reliably. Limina can also generate a PII inventory report prior to replacement, supporting the risk assessment that GDPR and the EDPB guidelines require before pseudonymization measures are finalized.

‍

What sectors most commonly need to pseudonymize unstructured data?

Healthcare organizations, pharmaceutical and life sciences companies, financial services firms, insurance providers, and contact centers all handle significant volumes of unstructured data containing personal identifiers. Clinical notes, patient records, loan applications, claims documents, and call transcripts are common examples. Each of these sectors is subject to regulatory obligations that require robust data protection measures, making accurate pseudonymization of unstructured data a critical compliance requirement.

‍