Unstructured data is any information that doesn't fit neatly into rows and columns—think emails, clinical notes, PDF forms, call recordings, and support chat logs. Unlike structured database records, it has no fixed schema, which makes automated privacy protection significantly harder.
Your database tables are probably well-protected. Your emails, call recordings, and clinical notes almost certainly aren't.
Unstructured data now accounts for an estimated 80 percent of all enterprise data, and it's growing three times faster than structured data. Yet most data governance programs were built around spreadsheets and database fields—not the messy, context-dependent information that lives in your inboxes, document repositories, and contact center transcripts.
That gap is where regulators are increasingly looking. HIPAA breach investigations, GDPR enforcement actions, and California Privacy Rights Act (CPRA) audits have all surfaced organizations caught off guard by PII they didn't know was hiding in unstructured formats. The fines are substantial; the reputational damage is worse.
This guide walks through the most common unstructured data examples, explains why each format carries specific privacy risks, and shows how automated de-identification closes the gap that manual review leaves open.
What counts as unstructured data?
Structured data lives in tables: it has defined fields, consistent types, and predictable formatting. A database row with columns for patient_id, date_of_birth, and diagnosis_code is structured. You know exactly where each piece of information sits.
Unstructured data is everything else. Natural language, mixed formats, embedded metadata, and context-dependent meaning make it resistant to simple pattern-matching or field-level encryption.
| Data Type |
Common Sources |
Typical PII Present |
| Emails |
Internal communications, customer support inboxes |
Names, addresses, account numbers, diagnoses mentioned in context |
| PDF Documents |
Forms, contracts, discharge summaries, invoices |
SSNs, DOBs, signatures, financial figures, diagnosis codes |
| Chat and Messaging Logs |
Customer support platforms, clinical chat tools, CRMs |
Names, contact details, symptoms, account credentials |
| Call Recordings and Transcripts |
Contact centers, telehealth visits, HR interviews |
SSNs read aloud, credit card numbers, PHI discussed verbally |
| Clinical Notes |
EHRs, dictation software, nursing notes |
Full PHI: names, DOBs, diagnoses, medications, provider names |
| Scanned Forms |
Intake paperwork, insurance claims, consent forms |
Handwritten names, policy numbers, SSNs, signatures |
| Support Tickets |
Helpdesk systems, CRM platforms |
Account numbers, device identifiers, partial card numbers |
Why unstructured data is harder to protect
Protecting a Social Security number in a structured database is relatively straightforward—you know exactly which field it's in, and you can encrypt or mask it at the column level. Protecting the same SSN when a customer reads it aloud on a call recording is an entirely different problem.
Three characteristics make unstructured data particularly challenging:
No schema: There's no predefined location for sensitive information. PII can appear anywhere in a document, transcript, or email body.
Natural language context: A name alone may not be sensitive, but a name combined with a diagnosis, location, or account number in the same sentence becomes identifying information. Context matters, and context requires language understanding, not just pattern matching.
Format diversity: The same piece of PII might appear in a typed PDF, a handwritten scanned form, a compressed audio file, or a real-time transcription with errors. Each format requires different detection methods.
General-purpose cloud PII tools—built primarily for structured data—miss between 13 and 46 percent of entities in real-world unstructured data. That's not a rounding error; it's a compliance gap.
Real-world privacy risks by data type
Emails
Enterprise email is one of the most overlooked sources of sensitive data. Employees routinely include PHI in internal emails ("Patient Johnson's labs came back—see below"), share account credentials or partial card numbers in customer threads, or forward documents containing PII without thinking twice.
Emails also survive longer than most organizations realize. Archive policies frequently retain messages for seven to ten years, meaning a 2016 customer service thread may still contain unredacted card numbers sitting in a cold storage bucket.
Key risks: PHI in provider-to-provider communications, account data in customer support threads, HR correspondence containing employee SSNs and salary details.
PDF documents
PDFs present a deceptive challenge. A structured-looking form—hospital discharge summary, insurance claim, mortgage application—appears organized, but the underlying text is unstructured from a machine's perspective. Scanned PDFs add an additional layer: optical character recognition (OCR) must first extract the text before any de-identification can occur.
PDFs also embed metadata that many teams overlook: document properties, revision history, and embedded thumbnails can all carry identifying information even after the visible content is redacted.
Key risks: SSNs and DOBs on intake forms, diagnosis and treatment details in discharge summaries, signatures on consent forms, financial data on invoices.
Chat and messaging logs
Customer support chat logs are a growing compliance surface. When a customer types their date of birth to verify an account, or describes symptoms to a telehealth chatbot, that data is captured in a log—often stored in a third-party CRM or ticketing system with weaker controls than your primary database.
Internal messaging platforms carry similar risk. Slack, Microsoft Teams, and equivalent tools are frequently used to share quick operational details: "The patient in Room 4 is Johnson, DOB 03/15/1962—can you pull her chart?" That message is now PHI, sitting in a chat archive.
Call recordings and ASR transcripts
Voice data is one of the most complex unstructured data challenges. Contact centers record millions of calls annually. Telehealth platforms capture provider-patient conversations. HR teams record interviews.
When those recordings are transcribed using automatic speech recognition (ASR), the transcript inherits not just the content of the conversation but also the errors. ASR systems frequently mishear names, transpose numbers, and introduce noise artifacts. A de-identification tool that relies on exact pattern matching will miss PII that the ASR system transcribed imperfectly.
Limina is designed specifically to handle ASR transcription errors, recognizing PII in noisy, imperfect transcripts with the same accuracy it achieves on clean text—a critical differentiator for contact centers and healthcare AI teams.
Clinical notes
Clinical notes are the most information-dense unstructured data type in healthcare. A physician's progress note can contain dozens of PHI elements in a single paragraph: patient name, date of birth, diagnosis, medications, provider names, referral destinations, and insurance identifiers—all woven into narrative prose.
Structured EHR fields are relatively easy to de-identify. The notes attached to those records are not. Organizations pursuing HIPAA Safe Harbor de-identification must address clinical notes explicitly—and that requires natural language understanding, not field-level masking.
Regulatory exposure: what HIPAA, GDPR, and CPRA require
All three major privacy frameworks treat unstructured data as in-scope. There is no exemption for data because it happens to be in a messy format.
| Regulation |
Unstructured Data Obligations |
De-identification Standard |
| HIPAA |
PHI in any form—including notes, recordings, and scanned documents—must be protected or de-identified |
Safe Harbor (remove 18 identifiers) or Expert Determination |
| GDPR |
Personal data in any format is in scope; "pseudonymization" reduces risk but doesn't remove obligations; true anonymization does |
No prescribed method; must be "irreversibly anonymized" to exit scope |
| CPRA |
Personal information regardless of format; includes inferences drawn from unstructured content |
No prescribed standard; must not be "reasonably linked" to a consumer |
One practical implication: if your organization uses AI tools trained on historical data, and that historical data includes unstructured PHI or PII, you may have created a compliance obligation you haven't yet addressed. De-identified training data isn't just a legal nicety—it's increasingly a requirement for responsible AI deployment.
How automated de-identification solves the problem
Manual review of unstructured data doesn't scale. A team of reviewers reading clinical notes, flagging PII in call transcripts, or scrubbing emails one by one is both expensive and inconsistent. Human reviewers miss things; automated tools miss different things—but at scale, only automation is operationally viable.
Effective automated de-identification for unstructured data requires:
Named entity recognition (NER) tuned for healthcare and financial contexts, not just generic PII types
Contextual understanding—recognizing that "Mr. Johnson in the 2:00 slot" is PHI in a clinical setting even without an explicit patient ID
Multi-format support: plain text, PDF, audio, images, and structured document formats
ASR error tolerance for voice and transcript data
Configurable replacement options: redaction, pseudonymization, synthetic substitution, or tokenization depending on the use case
The goal isn't just compliance—it's enabling your data to be used. De-identified unstructured data can feed AI training pipelines, analytics tools, and research programs that would otherwise require prohibitive data access controls.
Ready to find what's hiding in your unstructured data?
Unstructured data is where most enterprise PII risk lives—and where most data governance programs have the widest gaps. Limina's de-identification platform is built specifically to handle the formats, languages, and contextual complexity of real-world unstructured data, including clinical notes, ASR transcripts, PDFs, and email archives, achieving 99.5 percent accuracy on physician conversation data.
See what Limina finds in your data: get a demo at getlimina.ai/en/contact-us