December 20, 2024

How Limina Facilitates GDPR Compliance for AI Models: Insights from the EDPB's Latest Opinion

The EDPB's Opinion 28/2024 raises the stakes for GDPR compliance in AI development. Learn what the guidance means for your organization and how Limina helps you meet its requirements at every stage of the AI lifecycle.

Kathrin Gardhouse

The European Data Protection Board (EDPB) has issued some of the most consequential guidance yet on how organizations must approach GDPR compliance when developing and deploying AI models. Opinion 28/2024 addresses a set of questions that AI developers and downstream users have been grappling with since large language models entered the mainstream: when is an AI model truly anonymous, what constitutes a valid legal basis for processing personal data in training, and what does unlawfully processed training data mean for the organizations that rely on those models in production?

The stakes became impossible to ignore in December 2024, when Italy's data protection authority imposed a €15 million fine on OpenAI for failing to establish a valid legal basis for processing personal data as part of its generative AI model development. That decision is not an isolated event. It signals a period of active enforcement, and organizations that have not examined their AI pipelines through a GDPR lens need to do so now.

This article breaks down the EDPB's core findings, explains what they mean in practice, and shows how Limina's data de-identification platform helps organizations embed privacy into the AI lifecycle from the start.

What does EDPB Opinion 28/2024 actually say?

The Opinion is organized around three central questions that sit at the heart of GDPR compliance for AI. Understanding each of them is essential before thinking about solutions.

When can an AI model be considered anonymous?

The EDPB makes clear that claiming anonymity for an AI model is not straightforward. For a model to genuinely qualify as anonymous, it must satisfy two conditions simultaneously. First, personal data must not be extractable from the trained model, whether through standard queries or through adversarial attacks that probe the model's weights with or without direct access to the model itself. Second, no output generated by the model during inference should be capable of revealing identifiable information about the individuals whose data was used in training.

This matters because an anonymous model sits outside the GDPR's scope entirely, freeing it from data subject access requests, disclosure restrictions, and purpose limitations. But meeting the EDPB's threshold is technically demanding, and organizations should not assume that simply using a de-identification step before training is sufficient. Whether a model qualifies as anonymous is a determination that must be made on a case-by-case basis, ideally with expert input.

What legal basis applies to personal data processing in AI model development?

This is the question at the center of the OpenAI fine, and the EDPB's answer is both clarifying and sobering. Organizations that want to rely on legitimate interest as the legal basis for processing personal data in AI training must satisfy a three-step test: identifying the specific interest being pursued, demonstrating that processing personal data is necessary to achieve it, and showing that the interest is not overridden by the rights and freedoms of the affected data subjects.

The EDPB's guidance here is a double-edged sword. It provides a path forward for organizations that can satisfy the test rigorously, but it also places a substantial documentation and accountability burden on those that choose this route. For many organizations, particularly those without dedicated legal and privacy teams, that burden is significant.

The broader tension here is real. GDPR's core objective is protecting individuals from harm, but one of the regulation's stated purposes is also to support innovation within the EU. Requiring a granular legal basis test for every dataset used in AI training arguably sits in tension with the pace at which AI development moves. The EDPB's Opinion does not resolve this tension, but it makes the compliance requirements clear enough that organizations can no longer operate on ambiguity.

What does unlawful training data processing mean for downstream users?

This is perhaps the most practically urgent question for the many organizations that rely on commercially available large language models rather than training their own. The EDPB's answer is measured but not reassuring. A downstream controller, meaning any organization using an LLM in their own products or workflows, should conduct their own assessment of whether the underlying model was developed lawfully. A regulatory finding against the model developer, such as the Garante's decision regarding OpenAI, is one relevant factor in that assessment.

Critically, the Opinion does not state that using an LLM built on unlawfully processed data automatically makes the downstream user's processing unlawful. The downstream controller's own data processing activities are evaluated on their own merits. But the regulatory precedent now set by the OpenAI fine means that organizations using those models can no longer claim they had no basis for concern.

If you are using an LLM in a regulated environment and have not yet assessed the data provenance of that model, the time to do so is now. Speak with the Limina team to understand how de-identification can reduce your exposure across your AI workflows.

The challenge of GDPR compliance across the AI lifecycle

The development of AI models often depends on vast datasets, many of which contain personal data either because the intended use case requires it or because data was collected at scale from internet sources without rigorous filtering. Even when organizations believe their datasets are clean, the volume and variety of modern unstructured data makes that assumption unreliable.

Personal data surfaces in places that pattern-matching tools miss. It appears embedded in clinical notes, buried in customer service transcripts, scattered across financial documents, and woven into the conversational context of support interactions. This is the environment in which AI models are trained, and it is the environment in which the EDPB's requirements must be met.

For organizations operating in regulated sectors, the compliance challenge is compounded by the sensitivity of the data involved. A healthcare organization training a model on patient records, a financial services firm fine-tuning a model on client communications, or an insurer using AI to process claims documents all face the same fundamental tension: the data that makes AI models useful is often exactly the data that GDPR protects most strictly.

How Limina helps organizations meet EDPB requirements

Limina is built by linguists, which means its approach to identifying personal data goes well beyond simple pattern matching. The platform understands language in context, recognizing entity relationships within documents and accounting for the way that personal information is expressed in natural, unstructured text. That capability is precisely what is needed to address the EDPB's requirements at scale.

Minimizing personal data in training datasets

The EDPB's guidance on anonymity and the risk of adversarial extraction places a premium on reducing personal data in training datasets before a model is ever trained. Limina's data de-identification platform detects and redacts more than 50 types of personal identifiers, including sensitive categories such as health data and ethnic origin. This capability allows organizations to clean training datasets at scale before they are used, materially reducing the likelihood that personal data will be retained in model weights in a form that could be extracted through adversarial attacks or reproduced during inference.

Limina also replaces personal data with synthetic placeholders rather than simply removing it. This approach preserves the linguistic and structural richness of the training data while making it substantially harder to identify the individuals whose information originally appeared in the dataset. When some data points remain in their original form despite best efforts, the surrounding synthetic data makes them difficult to distinguish from their neighbors, reducing re-identification risk through membership inference or model inversion.

Whether this approach is sufficient for a model to be classified as anonymous is a question that must be answered on a case-by-case basis. But it directly addresses the technical requirements the EDPB identifies, and it provides a strong foundation for the assessment that organizations need to conduct.

Controlling personal data in model outputs

The same de-identification capability that protects training data can also be applied to the outputs generated by a model in production. This is particularly relevant for organizations using commercially available LLMs that may have been trained on personal data. By applying Limina's redaction layer to model outputs before they are surfaced to users, organizations can prevent the generation of personal data in their applications, even when the underlying model may have been exposed to personal data during its development.

This mechanism does not retroactively resolve the question of whether the model developer processed personal data lawfully, but it is a meaningful step that downstream users can take to protect individuals and demonstrate their own commitment to GDPR compliance within their sphere of control.

This capability is particularly relevant for organizations in healthcare, pharma and life sciences, financial services, insurance, and contact centers, where both the sensitivity of the data and the regulatory scrutiny applied to it are highest.

Demonstrating a legal basis for processing

The EDPB's three-step test for legitimate interest requires organizations to know, with precision, what personal data they are processing and why. Limina supports this process by providing the foundational data intelligence that makes a credible legal basis assessment possible.

The first and most essential step is understanding what personal data is actually present in the dataset being used. Without that knowledge, no legitimate interest assessment can be conducted with integrity. Limina's platform performs this identification at scale, across unstructured data formats that are otherwise difficult to audit manually.

From that baseline, organizations can enforce strict data minimization policies, ensuring that only the data genuinely necessary for a given processing purpose is retained. Limina also creates auditable records of how personal data has been handled, which supports the accountability obligations that GDPR imposes on data controllers and provides the documentation needed to respond to inquiries from supervisory authorities.

Simplifying responses to data subject rights

The EDPB's guidance reinforces that data subject rights, including access requests, apply even in the context of AI development. For organizations processing large volumes of unstructured data, responding to subject access requests manually is not feasible at scale. Limina automates the identification of personal data within unstructured datasets, making it possible to generate reports on what data is held and where. This capability transforms a labor-intensive compliance obligation into a manageable operational process.

What this means for regulated industries

The EDPB's Opinion has particular significance for industries where personal data is both abundant and tightly regulated. In healthcare and life sciences, AI is increasingly being used to analyze patient records, clinical trial data, and research datasets. The sensitive nature of health data places it in a special category under GDPR, and the requirements for processing it lawfully are correspondingly demanding. Limina's context-aware approach to de-identification is designed to handle exactly this kind of complex, sensitive data without sacrificing the linguistic richness that makes the data scientifically valuable.

In financial services, AI models are being deployed for fraud detection, credit assessment, and customer service automation, all of which depend on access to detailed personal and financial information. For contact centers, the challenge is the real-time processing of customer interactions that contain personal identifiers expressed in natural, conversational language. In insurance, claims processing and underwriting both involve detailed personal information across multiple document types.

Across all of these sectors, the EDPB's Opinion creates a clear imperative: organizations cannot rely on AI capabilities that were built without rigorous attention to data protection. The compliance burden falls on every organization in the chain, from model developers to downstream users. Reach out to Limina's team to see how our platform can be integrated into your specific data environment.

Balancing innovation and privacy

The tension between GDPR's innovation objectives and its data protection requirements is real, and the EDPB's Opinion does not eliminate it. Requiring a thorough legal basis assessment for every training dataset imposes a cost on AI development that smaller organizations may find difficult to absorb. The scale of data needed to develop capable AI models creates friction with the minimization principles that GDPR enshrines.

What the Opinion does do is establish a clear framework within which that tension must be managed. Organizations that take privacy seriously from the earliest stages of AI development, rather than treating it as a compliance exercise to be addressed after the fact, are better positioned to navigate that framework. Technical tools that reduce the presence of personal data in training datasets, monitor model outputs for personal information, and generate the documentation needed for accountability assessments are not a substitute for legal expertise, but they are an indispensable foundation for compliance.

Limina's platform is designed to be that foundation. By embedding privacy protection into the AI lifecycle at the data level, it enables organizations to pursue AI development with confidence that their practices can withstand regulatory scrutiny.

‍

Share this post

Copy link

Frequently Asked Questions

What is EDPB Opinion 28/2024 and why does it matter for AI?

EDPB Opinion 28/2024 provides authoritative guidance on how GDPR applies to the development and deployment of AI models. It addresses three core questions: when an AI model can be considered anonymous, what legal basis applies to processing personal data in AI training, and what the use of unlawfully processed training data means for downstream users. It matters because it clarifies obligations that were previously ambiguous and signals an era of active enforcement, as demonstrated by the €15 million fine imposed on OpenAI by Italy's data protection authority.

‍

Does GDPR apply to AI model training data?

Yes. If personal data is used during AI model training, GDPR applies. This means organizations must establish a valid legal basis for processing that data, implement data minimization and security measures, and ensure that data subject rights can be honored. The fact that the personal data is used to train a model rather than to directly serve a request does not exempt it from GDPR's scope.

‍

How can organizations demonstrate that an AI model is anonymous under GDPR?

According to the EDPB, an AI model can only be considered anonymous if personal data cannot be extracted from the model, whether through standard queries or adversarial attacks, and if the model's outputs cannot reveal identifiable information about individuals whose data was used in training. Demonstrating this requires rigorous technical assessment and should not be assumed based on de-identification of training data alone. Whether a given model qualifies as anonymous is a case-by-case determination.

‍

What is the legitimate interest test for AI model development under the EDPB's guidance?

The EDPB applies a three-step test to legitimate interest as a legal basis for processing personal data in AI development. Organizations must identify the specific interest being pursued, demonstrate that processing personal data is necessary to achieve it, and show that this interest is not overridden by the rights and freedoms of the individuals whose data is used. Each step must be substantiated with documentation, and the test must be applied before processing begins.

‍

What are the implications for organizations using commercially available LLMs that were developed with personal data?

The EDPB's Opinion indicates that downstream users of LLMs should assess whether the model developer processed personal data lawfully. A regulatory finding against the developer, such as the Garante's decision regarding OpenAI, is a relevant factor in that assessment. However, using an unlawfully developed model does not automatically make the downstream user's own processing unlawful. Downstream users should evaluate their own processing activities on their merits and consider applying technical safeguards, such as output redaction, to limit the risk of personal data being generated in their applications.

‍

How does de-identification help with GDPR compliance for AI?

De-identification reduces the presence of personal data in training datasets, limiting the risk that personal information will be retained in model weights and potentially extracted or reproduced during inference. It also supports the legal basis assessment by providing a clear picture of what data is being processed, enables data minimization, and generates documentation for accountability purposes. When applied to model outputs, it can prevent personal data from being surfaced to end users even when the underlying model may have been exposed to personal data during training.

‍

Which industries face the greatest compliance risk under the EDPB's AI guidance?

Industries that process large volumes of sensitive personal data face the greatest exposure. Healthcare and life sciences organizations using AI to analyze patient records or clinical trial data must navigate GDPR's special category data provisions. Financial services firms deploying AI for fraud detection or customer service handle detailed personal and financial information. Contact centers process personal data in real-time conversational contexts. Insurers use AI across claims and underwriting workflows that contain sensitive personal details. All of these sectors face heightened scrutiny under the EDPB's guidance.

‍