PDF RedactionDeep Dive

Deep Dive: How to Permanently Remove Sensitive Text, Metadata, and Hidden Data From a PDF

A technical walkthrough of every surface inside a PDF file that can carry sensitive data — text streams, metadata dictionaries, bookmarks, annotations, form fields, attachments, JavaScript, and more. Learn how to find them, how to inspect them, and how to actually remove them.

Published April 10, 202616 min read
RedactVault Support
pdf redactionpdf metadatahidden datadocument sanitizationpdf structuredeep dive

Most guides on PDF redaction focus on the text you can see on the page. Mark it, black it out, export. That part matters, obviously. But a PDF file is not just the visible pages. It is a structured container that can hold dozens of different data types, many of which never appear on screen, and most of which survive a standard redaction pass completely untouched.

This post is a deep dive into what is actually inside a PDF file, where sensitive data hides, and how to get rid of it. By the end, you will understand the file format well enough to look at any PDF and know which surfaces you need to check — not because someone told you a checklist, but because you understand why each surface exists in the first place.

What a PDF actually is, structurally

A PDF is not a picture. It is not a single blob of data. It is a collection of numbered objects — typically hundreds or thousands of them — connected by an internal reference system. Think of it like a small database packaged into a single file. Each object has a type and a purpose: one object might define a font, another might hold the text for page three, another might store a thumbnail image, and another might contain the document title.

When you open a PDF, your reader does not start at the top and read downwards. It starts at the end of the file, finds a cross-reference table (called the xref table), and uses that table to look up every object by number. The xref table is essentially an index. It says "object 14 starts at byte 28403, object 15 starts at byte 29110" and so on. Your reader jumps to whichever objects it needs for the page you are looking at.

This architecture has a consequence that matters enormously for redaction: deleting something from the visible page does not necessarily delete the object that contained it. The object might still be sitting in the file, unreferenced but intact. And the cross-reference table might still point to it, or a previous version of the cross-reference table might. The data is gone from the screen. It is not gone from the file.

The content stream: where visible text lives

Each page in a PDF has at least one content stream — a sequence of drawing instructions that tells the reader what to render. These instructions are written in a simple postfix language. A typical sequence might look something like this internally:

BT /F1 12 Tf 72 720 Td (Jane Smith) Tj ET

That reads: begin a text block, set the font to F1 at 12 points, move to position (72, 720) on the page, draw the string "Jane Smith", end the text block. Every piece of visible text on every page is produced by instructions like this. When a proper redaction tool removes text, it needs to find and modify or remove the Tj operator (or its relatives TJ, ', and ") that draw the sensitive text. Just drawing a black rectangle on top of that instruction — which is what most "redaction" tools actually do — adds a new drawing instruction after the text instruction. Both instructions remain in the stream.

A genuine redaction tool needs to parse the content stream, identify the text-drawing operators that produce the sensitive characters, and either remove them or replace the string data with blank content. This is harder than it sounds, because content streams can use different text encodings, embed text across multiple operators, and reference fonts that remap character codes. But it is the non-negotiable starting point.

Surface 1: The Info dictionary

Every PDF has an optional (but almost always present) Info dictionary. This is a single object that stores basic metadata: Title, Author, Subject, Keywords, Creator (the application that made the document), Producer (the library that generated the PDF), and creation/modification dates. It can also contain custom keys — some organisations add fields like CaseNumber or ClientName.

The Info dictionary is not visible on any page. You have to open File then Properties (or Document Info) in your PDF reader to see it. Because it is invisible during normal reading, it is almost always forgotten during redaction.

Leak risk: high. A document titled "Deposition of Maria Gonzalez — Case 2024-CV-4421" with the name "Maria Gonzalez" carefully redacted on every page still carries her name in the Title field. This pattern shows up constantly in court filings.

How to inspect: File then Properties in most readers. In Acrobat, the Description tab. In Chrome or Edge, the document properties panel. On Mac, open in Preview and press Cmd+I.

Surface 2: The XMP metadata stream

Here is where it gets sneaky. In addition to the Info dictionary, most modern PDFs also contain an XMP (Extensible Metadata Platform) stream. This is a separate object that stores metadata in XML format. It was introduced by Adobe in the early 2000s and has been the preferred metadata format since PDF 1.4. The critical thing to understand is that the XMP stream and the Info dictionary are two different copies of the metadata. They are supposed to be kept in sync, but they do not have to be, and tools that clean one sometimes miss the other.

The XMP stream can contain everything the Info dictionary has, plus a lot more: edit history, previous document titles, contributor lists, rights management information, and custom namespace extensions that applications can define freely. Some PDF generators embed the full revision history of the document title in XMP — so even if you changed the title before redacting, the earlier version might still be in the XMP block.

Leak risk: high. Many sanitization tools clear the Info dictionary but leave the XMP stream intact, or vice versa. The result is a document that looks clean in the Properties dialog but still carries the original metadata in a different location within the file.

How to inspect: Most PDF readers show the Info dictionary fields but do not expose the raw XMP. To see the XMP, you need a tool that reads it directly. On the command line, exiftool document.pdf will dump both the Info dictionary and the XMP fields. In Acrobat Pro, check File then Properties then Additional Metadata. Look for any field that contains information you intended to redact.

Surface 3: Bookmarks and the document outline

PDF bookmarks (formally called the "document outline") are a tree of named links that appear in a side panel. They are meant for navigation — click "Chapter 3" in the bookmarks panel and you jump to page 14. The names are stored as string objects inside the PDF, completely separate from the page content.

The leak scenario is straightforward. A contract has a section titled "Terms Specific to Acme Corporation." The name on the page gets redacted. The bookmark still says "Terms Specific to Acme Corporation." The bookmark panel is collapsed by default in most readers, so no one notices until the file is already shared.

Leak risk: medium. Not every PDF has bookmarks, but documents generated by Word, legal document management systems, and report generators often do. The longer and more structured the document, the more likely it has bookmarks with meaningful titles.

How to inspect: Open the bookmarks or navigation panel in your PDF reader. In Acrobat, click the bookmark icon in the left sidebar. In Chrome and Edge, click the sidebar toggle and look for an outline or table of contents panel. Expand every node and read every title.

Surface 4: Annotations and comments

PDF annotations are objects attached to specific locations on a page. The most familiar types are sticky notes, highlights, and text comments, but the PDF specification defines over 25 annotation types, including: text markup (highlights, underlines, strikeouts), stamps, ink drawings, file attachment annotations, sound annotations, movie annotations, widget annotations (form fields — more on those next), and redaction annotations.

Yes, "redaction annotations" are an actual annotation type in the PDF spec. In Acrobat's redaction workflow, the first step (marking text for redaction) creates redaction annotations. These annotations store the text that will be redacted — the original content — as part of the annotation object. The second step (applying the redactions) is what actually removes the underlying text and flattens the annotations. If you skip the second step, or if you "redact" using a tool that only creates the visual appearance of a redaction annotation without the apply step, the original text is stored inside the annotation itself.

Beyond redaction annotations, ordinary review comments are also a leak vector. A sticky note that says "Check with Jane about this paragraph" reveals a name. A highlight comment that quotes the original text defeats the redaction of that text. Review annotations are often left in documents because the person doing the redaction was not the person who did the review, and no one thought to check.

Leak risk: high. Annotations are page-associated objects that most redaction tools ignore entirely. The Acrobat two-step trap alone makes this one of the most common sources of real-world redaction failures.

How to inspect: Open the comments panel in your PDF reader. In Acrobat, click the speech-bubble icon in the right sidebar, or use View then Comments. In Preview on Mac, check View then Show Markup Toolbar then look for annotation indicators. Scroll through every annotation on every page.

Surface 5: Form fields and AcroForm data

Interactive PDF forms (the AcroForm system) store field definitions and their values as objects in the PDF. A text field has a current value (what the user typed) and optionally a default value (what the field shows before the user types anything). Both are stored in the field dictionary. If someone filled in a form and then "redacted" it by flattening the form — turning the fields into static text — the field data may still exist in the PDF objects even though the form is no longer interactive. The /V (value) and /DV (default value) entries in the field dictionary do not disappear just because the form appearance was flattened.

There is a subtler variant too. Some PDF forms use JavaScript to calculate or format field values. The scripts themselves can contain hardcoded references, validation patterns that reveal the expected format of sensitive data, or string constants that should have been redacted.

Leak risk: medium. Applies mainly to PDFs that originated as fillable forms. If the document was never a form, this surface does not exist. But government documents, medical intake forms, financial applications, and HR paperwork are frequently PDF forms, and these are exactly the kinds of documents people need to redact.

How to inspect: In Acrobat Pro, open the Prepare Form tool to see all field definitions. In a text editor or hex viewer, search for /V and /DV entries in the raw file. On the command line, qpdf --list-attachments document.pdf will not help here, but qpdf --json document.pdf | grep -i "/V" can surface form values in the object structure.

Surface 6: Embedded file attachments

PDFs can carry other files inside them, like a zip archive. These are called embedded file streams, and they can be anything: spreadsheets, earlier drafts of the same document, images, other PDFs. They appear in a panel (usually called Attachments) in the PDF reader, or they might be associated with specific annotations on a page.

The scenario that causes problems: a law firm prepares a redacted version of a contract and saves it as a PDF. The original Word file, or an earlier unredacted PDF draft, is embedded as an attachment because someone used the "attach file" feature during the review phase and forgot to remove it. The redacted PDF ships with the unredacted original tucked inside it.

Leak risk: medium to high. When it happens, the damage is total — the attachment is a complete unredacted copy. It does not happen on every document, but document management systems and legal review tools sometimes embed files automatically, and the attachments panel is collapsed by default in most readers.

How to inspect: In Acrobat, check View then Show/Hide then Navigation Panes then Attachments. Look for the paperclip icon in the left sidebar. In other readers, look for an attachments tab or sidebar panel. On the command line, qpdf --list-attachments document.pdf will list all embedded files.

Surface 7: JavaScript

PDFs can contain JavaScript that runs when the document is opened, when a page is viewed, or when a user interacts with a form field. The scripts are stored as string objects in the PDF, associated with either the document catalog (document-level scripts) or specific pages and annotations (page-level and field-level scripts). JavaScript in PDFs can do things like calculate form values, validate input, format dates, submit data to a URL, or trigger actions on page open. The /JS and /JavaScript keys in the PDF object tree hold these scripts.

From a data leakage perspective, JavaScript is a risk for two reasons. First, the scripts themselves can contain string literals — names, URLs, account numbers — that mirror content you redacted on the page. Second, some scripts submit form data to external URLs. A redacted PDF that phones home when opened is a problem most people would never think to check for.

Leak risk: low to medium. Most ordinary documents do not contain JavaScript. But government forms, financial applications, and interactive reports sometimes do. When JavaScript is present and contains sensitive data, it is almost never caught during redaction review.

How to inspect: In Acrobat Pro, open the JavaScript console (Ctrl+J) and check the document-level scripts, or use Edit then Preferences then JavaScript to see if scripts are present. On the command line, searching the raw PDF for /JS or /JavaScript will locate script objects.

Named destinations are a navigation mechanism in PDFs. Instead of a link saying "go to page 7 at position (72, 500)," it can say "go to the destination named section-plaintiff-interview." These names are stored in a name tree in the document catalog. They are often auto-generated from heading text during PDF creation — so a heading that says "Interview with Jane Smith" might produce a named destination called interview-with-jane-smith.

Link annotations (clickable links on the page) can also contain URI actions — the actual URL the link points to. A redacted link that visually shows a black bar still has the URL stored in the link annotation object. If the URL itself is sensitive (say, a link to a client portal with a session token in it), the visual redaction does nothing.

Leak risk: low to medium. Named destinations are a niche vector, but they are especially dangerous in legal documents where headings contain names and case numbers. Link URIs are a more common concern in documents that reference external systems.

How to inspect: There is no easy way to inspect named destinations in a standard PDF reader. On the command line, qpdf --json document.pdf dumps the full object tree including the name tree. For link URIs, hovering over clickable areas in any PDF reader will usually show the URL in a tooltip or status bar.

Surface 9: Incremental saves and the cross-reference history

This is the most technically subtle surface on the list, and the one that surprises people the most.

When a PDF is modified and saved, many applications do not rewrite the entire file. Instead, they append the changes to the end of the file and write a new cross-reference table that points to the updated objects. The old objects — including the ones that were "deleted" or "modified" — remain in the file, and the old cross-reference table still references them. This is called an incremental save.

What this means in practice: if you open a PDF, redact some text, and save using an incremental save, the original unredacted text objects are still physically present in the file. A standard PDF reader will follow the newest cross-reference table and display the redacted version. But anyone who examines the raw file bytes, or uses a tool that reads the older cross-reference tables, can recover the original content.

To eliminate incremental save history, the file needs to be rewritten from scratch — what the PDF specification calls "linearization" or what tools like qpdf call a "full rewrite." The command qpdf --linearize input.pdf output.pdf will rewrite the file, compacting out unreferenced objects and replacing the incremental cross-reference chain with a single clean table. A flattened image export also eliminates this problem entirely, because it builds a completely new PDF from rendered page images.

Leak risk: low in isolation, catastrophic in combination. Incremental save recovery requires someone with technical knowledge and intent. But when it works, it recovers the full original content — not just fragments. Forensic analysts and security researchers check for this routinely.

How to inspect: On the command line, qpdf --show-xref document.pdf will show the cross-reference structure. Multiple xref sections indicate incremental saves. You can also open the file in a hex editor and search for the string xref — if it appears more than once, the file has incremental save history.

Surface 10: Thumbnails and preview images

Some PDFs contain pre-rendered thumbnail images for each page, stored as separate image objects. These thumbnails are generated at the time the PDF was created or last saved, and they are not automatically updated when the page content changes. If you redact text on a page but the thumbnail was generated before the redaction, the thumbnail shows the original unredacted page.

Most modern PDF readers generate thumbnails on the fly and ignore embedded ones, so this is less of a risk than it used to be. But older readers and some document management systems still use embedded thumbnails, and the data is in the file regardless of whether any particular reader displays it.

Leak risk: low. It is a real vector but a rare one in modern workflows. Still worth stripping during sanitization because the cost of removal is zero.

Which surfaces leak the most in practice

Not all of these surfaces are equally dangerous. Based on what actually shows up in real incidents, here is a rough ranking:

  1. Content stream text (not removed, just covered) — by far the most common failure. The majority of public redaction incidents are this.
  2. Info dictionary and XMP metadata — second most common. Document titles and author names survive almost every redaction workflow that does not include a separate sanitization step.
  3. Annotations (especially unapplied redaction annotations) — third. The Acrobat two-step trap is a significant contributor here.
  4. Bookmarks — fourth. Common in structured documents from legal and government workflows.
  5. Embedded attachments — less frequent, but when it happens the damage is total.
  6. Everything else (form fields, JavaScript, named destinations, incremental saves, thumbnails) — rarer, but each one has appeared in documented incidents.

How the tools handle cleanup

Different tools approach the sanitization problem in fundamentally different ways. Understanding the approach matters, because "I used a redaction tool" does not tell you which surfaces were actually cleaned.

Adobe Acrobat Pro

Acrobat separates redaction from sanitization. The "Mark for Redaction" and "Apply Redactions" workflow handles the content stream — it removes the text under redaction bars when applied. But metadata, bookmarks, attachments, JavaScript, hidden layers, and other structural surfaces are handled by a separate feature: Protection then Remove Hidden Information (or in some versions, the Sanitize Document command).

The Remove Hidden Information dialog shows a list of data categories it can find and remove: metadata, bookmarks, attachments, comments, hidden text, overlapping objects, and more. You check the boxes you want to clean, and it strips them. It works well when you remember to use it. The problem is that it is a separate step with a separate menu, and nothing in the redaction workflow reminds you to do it. Plenty of people apply redactions, save, and share — without ever opening the sanitization panel.

Command-line tools: qpdf and exiftool

These are not redaction tools — they are PDF manipulation and metadata tools. But they are useful for specific cleanup tasks that other tools miss.

qpdf --linearize --remove-unreferenced-resources=yes input.pdf output.pdf rewrites the PDF from scratch, eliminating incremental save history and removing objects that are no longer referenced. This is the standard way to compact a PDF and clean out orphaned data.

exiftool -all= document.pdf strips the XMP metadata and most Info dictionary fields. The -all= flag means "set all writable tags to empty." It handles both the Info dictionary and the XMP stream, which makes it more thorough than tools that only address one of the two.

Neither tool handles bookmarks, annotations, form fields, or JavaScript — you need a PDF-aware tool for those. But for metadata and incremental save cleanup, this combination is hard to beat.

RedactVault

We build RedactVault, so read this as a description rather than a neutral recommendation. RedactVault treats sanitization as part of the export, not as a separate step. When you export a redacted PDF, the metadata, bookmarks, annotations, form data, and attachments are stripped automatically. There is no separate "sanitize" button to remember because the cleanup happens during the same export that handles the content-stream redaction.

For flattened image exports, this is straightforward — the export builds a brand new PDF from rendered page images, so none of the original structural data carries over. For native exports, the tool specifically removes the Info dictionary values, clears the XMP stream, strips bookmarks, annotations, and attachments, and writes a clean file without incremental save history.

The design choice here is that forgetting to sanitize should not be possible, because there is nothing to forget. Whether that trade-off suits your workflow depends on whether you ever want to preserve metadata or bookmarks intentionally — if you do, an integrated approach that strips everything is not the right fit.

A worked example: inspecting a PDF surface by surface

To make this concrete, here is how you would inspect a redacted PDF before sharing it, checking every surface covered in this post. Assume you have the redacted file ready and a few minutes before it needs to go out.

  1. Content stream. Open the file in a different reader than the one that redacted it. Try to select text under the redaction bars. Search (Ctrl+F) for a term you know was redacted. Press Ctrl+A then Ctrl+C on a redacted page and paste into a text editor. If any original text appears, the content stream was not properly cleaned.
  2. Info dictionary. Open File then Properties. Read Title, Author, Subject, Keywords, and any custom fields. None should contain information you redacted on the page.
  3. XMP metadata. If you have exiftool installed, run exiftool document.pdf and scan the output for any sensitive values. Pay attention to fields like XMP:Title, XMP:Description, XMP:Creator, and any history entries. If you do not have command-line access, check Acrobat Pro's Additional Metadata dialog.
  4. Bookmarks. Open the bookmarks/outline panel. Expand every node. Read every title.
  5. Annotations and comments. Open the comments panel. Scroll through every annotation on every page. Check for sticky notes, highlights with notes, review comments, and especially any that quote text you redacted.
  6. Form fields. If the document was ever a fillable form, check for residual field data. In Acrobat Pro, open the Prepare Form tool. If field definitions still exist with values, those values need to be cleared or the fields need to be removed.
  7. Attachments. Check the attachments panel (paperclip icon in Acrobat, or View then Attachments). If files are embedded, open them and check whether they contain unredacted content.
  8. JavaScript. In Acrobat Pro, open the JavaScript console (Ctrl+J) or search the raw file for /JS and /JavaScript. If scripts exist, read them and check for hardcoded sensitive values or external URLs.
  9. Incremental saves. If you have qpdf, run qpdf --show-xref document.pdf and check for multiple xref sections. If the file has incremental save history, rewrite it with qpdf --linearize input.pdf output.pdf.

That is nine checks. For a typical document where you have the right tools available, the full inspection takes ten to fifteen minutes. For a document you are nervous about, fifteen minutes is nothing compared to the cost of the leak you would catch.

The bottom line

A PDF is not a page. It is a database of objects, and the visible page is just one of the things stored in that database. Metadata, bookmarks, annotations, form values, attachments, JavaScript, named destinations, incremental save history, and thumbnails are all separate storage surfaces, each capable of carrying the exact information you thought you redacted.

The reason most redaction tools fail at this is not incompetence — it is architecture. Redaction tools were designed to modify page content. The other surfaces are outside their scope. Proper sanitization requires either a separate step (Acrobat's Remove Hidden Information), a combination of command-line tools (qpdf + exiftool), or an integrated tool that treats sanitization as part of export.

Now you know where the data hides and what to look for. That knowledge does not go stale — the PDF specification has been stable since 2008, and these surfaces are not going away. Every redacted PDF you handle from now on, you can check with the confidence of someone who understands the file format rather than someone following a checklist they do not fully understand. For the step-by-step verification routine, see How to check whether a PDF was redacted securely before sharing it. For background on why visual-only redactions fail, see Can redacted text still be recovered from a PDF?. If you want a tool that handles the cleanup automatically at export, try RedactVault.

FAQ

Common questions

What is the difference between redaction and sanitization?

Redaction removes specific content from the visible page — names, numbers, sentences. Sanitization removes data from all the other surfaces in the file: metadata, bookmarks, annotations, form fields, attachments, JavaScript, and incremental save history. A properly cleaned PDF needs both. Many tools handle redaction but leave sanitization to the user as a separate step.

Can I just use exiftool to clean a PDF?

Exiftool handles metadata (the Info dictionary and XMP stream) very well. But it does not touch bookmarks, annotations, form field values, JavaScript, embedded attachments, or the content stream. It is a useful part of a cleanup pipeline, but it is not sufficient on its own.

Does flattening a PDF to images remove all hidden data?

A flattened image export builds a new PDF from rendered page images. Because the new file is constructed from scratch, none of the original structural data — metadata, bookmarks, annotations, form fields, attachments, JavaScript, incremental saves — carries over. It is the most thorough approach, with the trade-off of larger files and loss of text searchability.

How do I know if a PDF has incremental save history?

On the command line, run qpdf --show-xref document.pdf. Multiple cross-reference sections indicate incremental saves. You can also search the raw file for the string "xref" — more than one occurrence means the file has been incrementally updated, and earlier object versions may still be present in the file.

Is the Acrobat "Remove Hidden Information" feature enough?

It covers most surfaces: metadata, bookmarks, annotations, attachments, hidden text, form data, and more. It is a thorough tool when used correctly. The risk is that it is a separate step from the redaction workflow — nothing prompts you to use it after applying redactions, so it is easy to forget. If you remember to run it every time, it handles the job well.

RedactVault

Want sanitization built into the export step?

RedactVault strips metadata, bookmarks, annotations, attachments, and form data automatically when you export. No separate sanitization step to remember.

Open RedactVault

Continue reading