Creating De-Identified Embeddings
Embeddings are an increasingly popular data science tool, used across the industry for various solutions.

Embeddings are an increasingly popular data science tool, used across the industry for various solutions.
What are Embeddings?
Embeddings essentially a numerical representation of data, used to determine the relationship (if any) between entities. For more information on embeddings, see here.
One popular use for embeddings is to send content to LLMs that they haven’t been trained on. Passing an entire file is often too large for the prompt of the LLM, so only segments of the file can be sent, along with a question or direction for the LLM. Using embeddings, the most relevant parts of the file can be sent in the prompt, so the LLM gets the best context to answer the question.
In this article, we’ll show you how use de-identified embeddings to get meaningful context, while adding a layer of privacy to keep your data safe.
Getting Started
To test out how embeddings work, we’ll be using a summary of each group’s performance during the 2022 World Cup, taken from Wikipedia.
In order to create our solution we first need to setup our development environment and ensure we can connect to all the necessary services.
Private AI Service
#1 - If you don’t already have access to the Private AI deidentification service, see our guide on getting setup with AWS
#2 - If you don’t already have access to the Private AI deidentification service, you can request a free api key.
OpenAI
An OpenAI API key is needed to capture the text embeddings for our document. You can sign up here.
Python Environment
We’ll be coding this solution with Python. If you don’t have a python environment setup, see the official Python for Beginners guide to get setup quick and easy.
Installing Dependencies
We’ll need several python modules to get started.
In order receive embeddings for our input, we need access to an embedding model. OpenAI’s ADA-002 is a good example. We’ll install the OpenAI python client so requests can be made easily.
We’ll also install Private AI’s python client for easy text deidentification.
Scipy will help determine the relation between our questions and document.
And we’ll store our data in Pandas, for easy retrieval.
Now that the environment is all setup, we’re ready to get coding. Let’s setup an initial script to get the sample data in the form of a panda dataframe.
The text from the file needs to be split into chunks in order for the script to only use the most relatable sections for the questions being asked. This can be done a variety of ways (such as counting the tokens in the document), but for simplicity we’re going to split the data with a delimiter. Each group in the sample data has been separated by 2 newline characters, so we’ll use that as a delimiter for creating the chunks.
Now that we have the data from the sample document, we need to get and store the embeddings for each entry. Let’s add a function to get the embeddings from OpenAI and store them in our data frame, next to the associated text.
First we’ll add our embedding function:
Then we’ll update the get_dataframe function to get and store the embeddings for each chunk of text.
Our data is ready to be used! The dataframe contains both the text and embeddings:
At this point we need to be able to add questions and have those questions compared to the data to see how relevant it is. Let’s add an input loop to the main function, and a function to find the most relevant chunk of data for our question.
The get_related_text function needs several steps:
- Get the embedding for the question being asked
- Compare the embedding of the question to the chunks of text from the file to find the best match
We’ll add one more function to find the numerical relatedness of the chunk and question:
Great, now the script is ready to run!
Let’s test it out with a question:
Output:
The script has determined a correct group!
Adding Privacy
The script is working as intended, but there’s one major issue: any sensitive information contained in the data is being sent to OpenAI when the embeddings are being obtained. This is where Private AI’s deidentification service comes in! Let’s update our function to keep the data private and secure.
Let’s add a function to handle the deidentification of and PII in the data.
And we’ll update the get_dataframe function to deidentify the data before getting the embeddings from OpenAI.
Before we start asking questions, let’s update the main function to see a comparison of deidentified vs. regular embeddings.
And our script is complete! Here’s the full code:
If we test out the embedding accuracy with a question:
We can see that the deidentified embeddings are able to capture the correct context!
