May 23, 2024

Unlocking the Power of Retrieval Augmented Generation with Added Privacy: A Comprehensive Guide

RAG is a popular approach that improves the accuracy of LLMs by utilizing a knowledge base. In this blog post, we illustrate how to implement RAG without compromising the privacy of your data.

Private AI

Company

RAG is a popular approach that improves the accuracy of LLMs by utilizing a knowledge base. In this blog post, we illustrate how to implement RAG without compromising the privacy of your data.

What is RAG

Large language models, such as OpenAI’s gpt-4-turbo and Anthropic’s Claude-3,are very powerful assistants that help us carry out different tasks like summarization, translation or answering our most intriguing questions. However, sometimes they nail the answer to questions, other times they regurgitate random facts from their training data. They may also hallucinate perfectly plausible, yet completely false statements.

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information.

Using a RAG workflow for an LLM-based application has the following benefits:

Improved accuracy: By retrieving relevant information from a knowledge base or external sources, the LLM can provide more accurate and factual answers to user queries. This is particularly useful when the LLM's internal knowledge is outdated or lacks specific details.
Enhanced context awareness: RAG enables the LLM to incorporate context from retrieved information, making it more effective at answering questions that require background knowledge or understanding of specific domains.
Reduced hallucination: LLMs sometimes generate plausible but incorrect information, a phenomenon known as hallucination. By incorporating retrieved information, RAG can help reduce this issue and provide more reliable answers.
Scalability: RAG allows for the easy integration of new knowledge sources, making it easier to scale the system as new information becomes available.
Interpretability: With RAG, you can trace back the source of the information used to generate an answer, which can help improve the interpretability and trustworthiness of the system.

The basic workflow for a RAG pipeline is as follows:

A user asks a question (we call this the query)
Given the query, we retrieve the relevant documents and paragraphs from our knowledge base.
We feed the query and relevant documents to an LLM and ask it to generate a response.
The response is sent to the user (Optionally, we can also send the relevant documents to the user).

Knowledge base and retrieval

The knowledge base is built by chunking and embedding the source data into vectors. The embedding is done using an embedding model such as OpenAI’s text-embedding-3-small

These embeddings are then stored in a vector store where each embedding vector is linked to its chunk. These vector stores allow us to efficiently find the most similar vectors given a query vector.

To retrieve relevant documents when given a query, we do the following:

Embed the query into a vector.
Get the most similar vectors to our query vector.
Retrieve the chunks corresponding to these vectors.

Where are the privacy risks in RAG?

RAG has proved to be a very effective method to increase the accuracy and robustness of LLM responses. However, it also has its own risks and challenges.

From a data privacy perspective, it has two main risks:

Source data and embeddings
Prompt data and LLMs

1. Source data & embeddings

If you’re building a RAG pipeline to answer questions about your company’s internal docs and guides, you’ll need to chunk and embed these documents. Embedding the documents often means sending the raw text into a third party embedding provider like OpenAI or Cohere. This involves a huge privacy risk as you’re potentially exposing confidential data such as dates, product names, specification, source code or project briefs. In addition, these docs may contain personal information about your clients or employees such as names, dates of birth, social security numbers or salaries. In addition, the user query also needs to be embedded as these embeddings will be used to retrieve the relevant chunks. This is another area where there is a potential data privacy risk.

2. Prompt data sent to LLM

The second main step of a RAG pipeline is utilizing an LLM to generate a response given the query and relevant documents. This means that if we’re using an LLM provider such as OpenAI or Anthropic, we need to send the user query and all retrieved documents as a prompt to the LLM. This is another source of privacy risk in the RAG pipeline

Setting Up (llama-index)

Note: We have an end-to-end colab notebook for this tutorial. You can find it here

For this tutorial, we will use our worldcup data which was used in previous tutorials as well.

Let’s set up a basic rag workflow using llama-index, a popular framework for building production-grade RAG pipelines.

Our pipeline will have the following steps:

The documents are chunked, embedded and stored in our vectorstore
The vectorstore is used as an index; when a new query comes in, we embed it and search the index for the relevant documents.
The query and the relevant docs are fed into the LLM which will generate the response.

We will use llama-index to orchestrate all these steps so the first step is to install it

Vector Store

We will use ChromaDB as our vector store. ChromaDB is an open source vector store framework that allows us to efficiently store and query multidimensional embeddings. We can leverage the llama-index connector for ChromaDB to set it up easily. Let’s install the llama-index connector for ChromaDB (this also installs ChromaDB itself)

Let’s set up a ChromaDB instance and create a collection for our data. We will then create our vector store index

Document Loading and Chunking

Now we need to load and chunk our document to generate nodes that will be embedded and stored in the vector store index

Embedding Source Documents

Let’s embed these nodes and store them in our index. We will be using OpenAI’s embedding model so make sure to create a valid API key and store it in the variable OPENAI_API_KEY

LLM Assistant

Let’s create an LLM Assistant that will answer the user’s questions. We will be using an OpenAI model, specifically “gpt-3.5-turbo”. We will choose a low temperature to ensure the model’s responses are factual and it is less prone to hallucination

RAG Pipeline

Let’s now put all these pieces together and create our RAG pipeline. We will implement it by creating a function that takes the user’s query as input and returns the response as its output

Let’s now test this RAG pipeline by asking a question and inspecting the answer:‍

Great! Our RAG pipeline is working as intended. However, we’re sending all our queries and document chunks to a third party provider. This poses a serious privacy risk. One way to handle this is to redact sensitive information. Let’s see how we can do this.

Prompt only privacy

Our first approach will be concerned with the prompt sent to the LLM. If you’re using a third party API provider such as OpenAI or Anthropic, any data included in the prompt will be shared with these providers and this poses a huge risk. One way to mitigate this is to redact the prompt; this masks out PII data before sending it to the API provider.

Redacting the prompt

PrivateAI is built with ease-of-use in mind. We can conveniently redact the input to the LLM by simply invoking the PrivateAI API to pseudonymize the retrieved context and sending the pseudonymized prompt to the LLM instead.

Let’s first set up our PrivateAI client. If you’re using the hosted public PrivateAI endpoint, make sure to store your API key in the PRIVATEAI_API_KEY variable.

We have the ability to customize how the Private AI engine handles entities. For example, we can instruct it to block or allow certain entities based on a regex expression.

Now let’s create a function that redacts a piece of text. In this function, we will use a regex pattern to detect and block percentage values. This can be helpful if, for example, we’re dealing with financial data and the growth numbers are confidential.

We will now create a new function that will add a pseudonymization step:

Let’s test our new RAG pipeline:

Great! Our new pipeline is working as expected and is redacting sensitive information, such as organizations, from our prompts.

However, this output isn’t very helpful to the user since they don’t know which teams were involved in a goalless draw. Let’s see how we can handle this.

Re-identifying the prompt

We can go one step further and re-identify the redacted entities in the LLM’s response. This makes the responses much more informative for the user.

To do this, we will add another step in our pipeline that takes our redacted response, sends it to PrivateAI and retrieves the de-identified text:‍

For more details about re-identifying redacted text, please take a look at our detailed example

Redacting the source data

Our second approach will be concerned with the source data that is chunked and sent to the embedding model. If you’re using a third party API provider such as OpenAI or Cohere, any data included in the chunks will be shared with these providers and this poses a huge risk. One way to mitigate this is to redact the source data; this masks out PII data before sending it to the API provider.

We can use the same function to redact the chunks before creating the nodes:

We have now created a new vector store index where the embeddings are based on the redacted document chunks.

Prompt flow with redacted data

Since our embeddings are based on the redacted source document chunk, we also need to redact the query before embedding it. Otherwise, our vector search might fail to retrieve the relevant documents since there will be a mismatch between entities in the query and entities in the source. In addition, as we mentioned earlier, user queries themselves can contain sensitive information which require redacting.

We can easily achieve this as follows:

Comparing the two methods

We’ve now covered two approaches in detail. Let’s compare the two methods:

	Prompt-only Privacy	Source Documents Privacy
Operates on	The input to the LLM	The input to the embedding model
Supported by PrivateAI?	Yes	Yes
Batch / Online	Can only be done online as we don’t know the inputs in advance	Can be done in offline batch mode as we may already have the documents in advance
When to use	When using a third party LLM API provider such as OpenAI and Anthropic	When using a third party embedding API provider such as OpenAI and Cohere

First party data risk

First-party data risk refers to the potential harm or liability that an organization faces when collecting, storing, and using its own customers', users' or employees’ personal data. This type of data is typically collected directly from the individual, often through interactions with the organization's website, mobile app, or other digital platforms when onboarding new users and employees.

There are different forms of first-party data. Examples are:

Personal identifiable information (PII) such as names, addresses, phone numbers, and email addresses
Behavioral data, like browsing history, search queries, and purchase history
Device information, including IP addresses, device IDs, and location data
Sensitive information, like health data, financial information, or political affiliations

This type of data can be compromised in different ways such as:

Data breaches: A company's database is hacked, exposing millions of customers' personal data, including credit card numbers and addresses.
Unintended data sharing: An organization inadvertently shares confidential information such as NDAs, SSNs and salaries with all users.
Insufficient data anonymization: A company fails to properly anonymize customer data, allowing individuals to be re-identified and compromising their privacy.

For our RAG pipeline, points 2 and 3 are especially important. An employee from the engineering team shouldn’t have access to the salaries or SSNs of other employees. On the other hand, a member of the HR team might need this information in their day-to-day tasks and responsibilities.

A potential solution for the second point, Unintended data sharing, is to create role-based access levels to the vector store. This ensures that the confidential information will be retrieved for queries by the HR team but not for other teams.

For the third point regarding Insufficient data anonymization, redacting the embeddings properly is an effective way to prevent the accidental re-identification of users’ data.

Wrap up and summary

In this comprehensive guide, we explored the concept of Retrieval Augmented Generation (RAG), a powerful approach to improving the accuracy of Large Language Models (LLMs) by grounding them on external sources of knowledge. We demonstrated how to implement RAG without compromising data privacy, a critical concern when working with sensitive information.

We discussed two main privacy risks in RAG pipelines: (1) source data and embeddings, and (2) prompt data sent to LLMs. To mitigate these risks, we presented two approaches: (1) prompt-only privacy, which redacts sensitive information from the input prompt sent to the LLM, and (2) source documents privacy, which redacts sensitive information from the source documents before embedding them.

We implemented a basic RAG pipeline using llama-index, a popular framework for building production-grade RAG pipelines, and demonstrated how to easily and conveniently add privacy features to the pipeline using PrivateAI, a privacy-enhancing technology. We also compared the two approaches, highlighting their differences in terms of operation, support, and use cases.

By following this guide, developers and organizations can unlock the power of RAG while ensuring the privacy and security of their data.

‍
Frequently Asked Questions (FAQ) about RAG and Privacy

What is Retrieval-Augmented Generation (RAG)?

RAG is an AI framework designed to improve the accuracy and reliability of Large Language Models (LLMs). It works by "grounding" the LLM on external sources of knowledge (a knowledge base) to supplement the information the model learned during its original training, which helps reduce hallucinations and provides current, factual answers.

What are the main benefits of using RAG with an LLM application?

RAG provides several benefits, including:

Improved Accuracy and Factual Answers: By using an external knowledge base.
Reduced Hallucination: It minimizes the LLM's tendency to generate plausible but incorrect information.
Enhanced Context Awareness: It allows the LLM to incorporate domain-specific knowledge.
Interpretability: You can trace the source documents used to generate the response.

Where do the primary data privacy risks occur in a standard RAG pipeline?

The two main privacy risks in a RAG pipeline stem from sharing data with third-party providers (like embedding or LLM APIs):

Source Data & Embeddings: Sending raw, potentially confidential source documents (like internal company guides or employee PII) to a third-party embedding provider.
Prompt Data Sent to LLM: Sending the user query and the retrieved confidential document chunks as a single prompt to the LLM provider for final answer generation.

What are the two core approaches to adding privacy to a RAG pipeline?

The two approaches address different parts of the pipeline:

Prompt-only Privacy: Focuses on redacting sensitive information from the input prompt just before it is sent to the LLM (Large Language Model). This is typically an online, real-time process.
Source Documents Privacy: Focuses on redacting sensitive information from the source documents before they are chunked and embedded by the third-party embedding model. This can be done in an offline batch process.

Why is re-identifying the prompt useful after redaction?

While redacting the prompt masks sensitive data for the LLM provider, the resulting LLM response may contain placeholders (like [ORGANIZATION_4]), making the answer unhelpful to the end user. Re-identifying takes the LLM's de-identified response and replaces the placeholders with the original sensitive information, providing the user with a complete and informative answer while keeping the data private during transit to the LLM.

‍

Share this post

Copy link