April 8, 2026

Unstructured Data Examples: The Hidden Privacy Risks in Emails, PDFs, and Chat Logs

Q: Why do standard cloud PII tools underperform on unstructured data?

General-purpose cloud tools often rely on pattern matching and struggle with contextual PII where sensitivity depends on surrounding information. Independent benchmarking has found that such tools miss between 13 and 46 percent of PII entities in real-world unstructured datasets.

Q: What's the difference between redaction and de-identification for unstructured data?

Redaction removes or blacks out information, while de-identification is a broader process that can include pseudonymization (replacing real values with realistic fakes), tokenization, and synthetic data generation to maintain the utility of the data for AI training or analytics.

Q: How do you de-identify audio call recordings?

Audio de-identification typically involves transcribing the recording using an ASR (Automatic Speech Recognition) system, then applying NLP-based de-identification to the transcript. The tool must be robust enough to handle ASR transcription errors to ensure no PII is missed.

Q: Is unstructured data in AI training pipelines a compliance risk?

Yes. If an AI model is trained on unredacted clinical notes, customer emails, or call transcripts, you may inadvertently embed PII into the model's weights. De-identifying training data before use is the most defensible approach under regulations like HIPAA and the EU AI Act.

While most organizations focus on securing database rows, 80 percent of enterprise risk lives in "unstructured" data—the messy, conversational information found in emails, call transcripts, and PDFs. This guide breaks down why these formats are the new frontier for regulators and how to protect them without losing their utility.

Patricia Graciano

Unstructured data is any information that doesn't fit neatly into rows and columns—think emails, clinical notes, PDF forms, call recordings, and support chat logs. Unlike structured database records, it has no fixed schema, which makes automated privacy protection significantly harder.

Your database tables are probably well-protected. Your emails, call recordings, and clinical notes almost certainly aren't.

Unstructured data now accounts for an estimated 80 percent of all enterprise data, and it's growing three times faster than structured data. Yet most data governance programs were built around spreadsheets and database fields—not the messy, context-dependent information that lives in your inboxes, document repositories, and contact center transcripts.

That gap is where regulators are increasingly looking. HIPAA breach investigations, GDPR enforcement actions, and California Privacy Rights Act (CPRA) audits have all surfaced organizations caught off guard by PII they didn't know was hiding in unstructured formats. The fines are substantial; the reputational damage is worse.

This guide walks through the most common unstructured data examples, explains why each format carries specific privacy risks, and shows how automated de-identification closes the gap that manual review leaves open.

What counts as unstructured data?

Structured data lives in tables: it has defined fields, consistent types, and predictable formatting. A database row with columns for patient_id, date_of_birth, and diagnosis_code is structured. You know exactly where each piece of information sits.

Unstructured data is everything else. Natural language, mixed formats, embedded metadata, and context-dependent meaning make it resistant to simple pattern-matching or field-level encryption.

Data Type	Common Sources	Typical PII Present
Emails	Internal communications, customer support inboxes	Names, addresses, account numbers, diagnoses mentioned in context
PDF Documents	Forms, contracts, discharge summaries, invoices	SSNs, DOBs, signatures, financial figures, diagnosis codes
Chat and Messaging Logs	Customer support platforms, clinical chat tools, CRMs	Names, contact details, symptoms, account credentials
Call Recordings and Transcripts	Contact centers, telehealth visits, HR interviews	SSNs read aloud, credit card numbers, PHI discussed verbally
Clinical Notes	EHRs, dictation software, nursing notes	Full PHI: names, DOBs, diagnoses, medications, provider names
Scanned Forms	Intake paperwork, insurance claims, consent forms	Handwritten names, policy numbers, SSNs, signatures
Support Tickets	Helpdesk systems, CRM platforms	Account numbers, device identifiers, partial card numbers

Why unstructured data is harder to protect

Protecting a Social Security number in a structured database is relatively straightforward—you know exactly which field it's in, and you can encrypt or mask it at the column level. Protecting the same SSN when a customer reads it aloud on a call recording is an entirely different problem.

Three characteristics make unstructured data particularly challenging:

No schema: There's no predefined location for sensitive information. PII can appear anywhere in a document, transcript, or email body.
Natural language context: A name alone may not be sensitive, but a name combined with a diagnosis, location, or account number in the same sentence becomes identifying information. Context matters, and context requires language understanding, not just pattern matching.
Format diversity: The same piece of PII might appear in a typed PDF, a handwritten scanned form, a compressed audio file, or a real-time transcription with errors. Each format requires different detection methods.

General-purpose cloud PII tools—built primarily for structured data—miss between 13 and 46 percent of entities in real-world unstructured data. That's not a rounding error; it's a compliance gap.

Real-world privacy risks by data type

Emails

Enterprise email is one of the most overlooked sources of sensitive data. Employees routinely include PHI in internal emails ("Patient Johnson's labs came back—see below"), share account credentials or partial card numbers in customer threads, or forward documents containing PII without thinking twice.

Emails also survive longer than most organizations realize. Archive policies frequently retain messages for seven to ten years, meaning a 2016 customer service thread may still contain unredacted card numbers sitting in a cold storage bucket.

Key risks: PHI in provider-to-provider communications, account data in customer support threads, HR correspondence containing employee SSNs and salary details.

PDF documents

PDFs present a deceptive challenge. A structured-looking form—hospital discharge summary, insurance claim, mortgage application—appears organized, but the underlying text is unstructured from a machine's perspective. Scanned PDFs add an additional layer: optical character recognition (OCR) must first extract the text before any de-identification can occur.

PDFs also embed metadata that many teams overlook: document properties, revision history, and embedded thumbnails can all carry identifying information even after the visible content is redacted.

Key risks: SSNs and DOBs on intake forms, diagnosis and treatment details in discharge summaries, signatures on consent forms, financial data on invoices.

Chat and messaging logs

Customer support chat logs are a growing compliance surface. When a customer types their date of birth to verify an account, or describes symptoms to a telehealth chatbot, that data is captured in a log—often stored in a third-party CRM or ticketing system with weaker controls than your primary database.

Internal messaging platforms carry similar risk. Slack, Microsoft Teams, and equivalent tools are frequently used to share quick operational details: "The patient in Room 4 is Johnson, DOB 03/15/1962—can you pull her chart?" That message is now PHI, sitting in a chat archive.

Call recordings and ASR transcripts

Voice data is one of the most complex unstructured data challenges. Contact centers record millions of calls annually. Telehealth platforms capture provider-patient conversations. HR teams record interviews.

When those recordings are transcribed using automatic speech recognition (ASR), the transcript inherits not just the content of the conversation but also the errors. ASR systems frequently mishear names, transpose numbers, and introduce noise artifacts. A de-identification tool that relies on exact pattern matching will miss PII that the ASR system transcribed imperfectly.

Limina is designed specifically to handle ASR transcription errors, recognizing PII in noisy, imperfect transcripts with the same accuracy it achieves on clean text—a critical differentiator for contact centers and healthcare AI teams.

Clinical notes

Clinical notes are the most information-dense unstructured data type in healthcare. A physician's progress note can contain dozens of PHI elements in a single paragraph: patient name, date of birth, diagnosis, medications, provider names, referral destinations, and insurance identifiers—all woven into narrative prose.

Structured EHR fields are relatively easy to de-identify. The notes attached to those records are not. Organizations pursuing HIPAA Safe Harbor de-identification must address clinical notes explicitly—and that requires natural language understanding, not field-level masking.

Regulatory exposure: what HIPAA, GDPR, and CPRA require

All three major privacy frameworks treat unstructured data as in-scope. There is no exemption for data because it happens to be in a messy format.

Regulation	Unstructured Data Obligations	De-identification Standard
HIPAA	PHI in any form—including notes, recordings, and scanned documents—must be protected or de-identified	Safe Harbor (remove 18 identifiers) or Expert Determination
GDPR	Personal data in any format is in scope; "pseudonymization" reduces risk but doesn't remove obligations; true anonymization does	No prescribed method; must be "irreversibly anonymized" to exit scope
CPRA	Personal information regardless of format; includes inferences drawn from unstructured content	No prescribed standard; must not be "reasonably linked" to a consumer

One practical implication: if your organization uses AI tools trained on historical data, and that historical data includes unstructured PHI or PII, you may have created a compliance obligation you haven't yet addressed. De-identified training data isn't just a legal nicety—it's increasingly a requirement for responsible AI deployment.

How automated de-identification solves the problem

Manual review of unstructured data doesn't scale. A team of reviewers reading clinical notes, flagging PII in call transcripts, or scrubbing emails one by one is both expensive and inconsistent. Human reviewers miss things; automated tools miss different things—but at scale, only automation is operationally viable.

Effective automated de-identification for unstructured data requires:

Named entity recognition (NER) tuned for healthcare and financial contexts, not just generic PII types
Contextual understanding—recognizing that "Mr. Johnson in the 2:00 slot" is PHI in a clinical setting even without an explicit patient ID
Multi-format support: plain text, PDF, audio, images, and structured document formats
ASR error tolerance for voice and transcript data
Configurable replacement options: redaction, pseudonymization, synthetic substitution, or tokenization depending on the use case

The goal isn't just compliance—it's enabling your data to be used. De-identified unstructured data can feed AI training pipelines, analytics tools, and research programs that would otherwise require prohibitive data access controls.

Ready to find what's hiding in your unstructured data?

Unstructured data is where most enterprise PII risk lives—and where most data governance programs have the widest gaps. Limina's de-identification platform is built specifically to handle the formats, languages, and contextual complexity of real-world unstructured data, including clinical notes, ASR transcripts, PDFs, and email archives, achieving 99.5 percent accuracy on physician conversation data.

See what Limina finds in your data: get a demo at getlimina.ai/en/contact-us

Share this post

Copy link

Frequently Asked Questions

What is the most common example of unstructured data containing PII?

Email is likely the most pervasive source of hidden PII in enterprise environments. Customer support emails frequently contain names, account numbers, partial payment information, and—in healthcare contexts—diagnoses or symptoms shared in the course of resolving a service issue. Because email archives are often governed more loosely than primary databases, unredacted sensitive data can accumulate for years.

Does HIPAA apply to clinical notes and physician emails?

Yes. HIPAA’s definition of Protected Health Information (PHI) covers individually identifiable health information in any format or medium, including handwritten notes, typed clinical documentation, emails between providers, and voice recordings. There is no exception for unstructured formats. If the information relates to a patient’s health, treatment, or payment and can be linked to an individual, it is PHI.

Why do standard cloud PII tools underperform on unstructured data?

General-purpose cloud tools are typically trained on structured or semi-structured data and rely on pattern matching for common PII types like SSNs and email addresses. They struggle with contextual PII—where sensitivity depends on surrounding information—and with domain-specific language in healthcare or financial services. Independent benchmarking has found that such tools miss 13 to 46 percent of PII entities in real-world unstructured datasets.

What’s the difference between redaction and de-identification for unstructured data?

Redaction removes or blacks out information, producing a document with gaps. De-identification is broader: it can include redaction, but also pseudonymization (replacing real values with fake but realistic ones), tokenization (replacing values with reversible tokens), and synthetic data generation. For use cases like AI training or analytics where you need complete, coherent text, pseudonymization or synthetic replacement is often preferable to simple redaction.

How do you de-identify audio call recordings?

Audio de-identification typically involves two steps: transcribing the recording using an ASR system, then applying NLP-based de-identification to the transcript. The transcript can then be used for analytics, AI training, or compliance review. Replacing PII in the original audio (by bleeping or substituting) is technically more complex and less common for most compliance use cases. The key challenge is handling ASR transcription errors—de-identification tools must be robust to imperfect transcriptions to avoid missing entities.

Is unstructured data in AI training pipelines a compliance risk?

Yes, and it’s a growing one. If your AI model was trained on historical data that included unredacted clinical notes, customer emails, or call transcripts, you may have inadvertently embedded PHI or PII into the model’s weights. Regulatory guidance under HIPAA and the EU AI Act is evolving to address this directly. De-identifying training data before use is the most defensible approach for organizations in regulated industries.