PDF RedactionPDF Redaction

The Hidden Problem With Redacting Scanned PDFs: If OCR Can't Read It, Your Tool Can't Find It

Image-based PDFs — scans, faxes, phone photos — are a different redaction problem from native PDFs. The detector is not looking at the document. It is looking at an OCR transcription of the document, and every error in that transcription is a sensitive field your tool will silently miss.

Published April 11, 202615 min read
RedactVault Support
pdf redactionscanned pdfsocrimage pdfdocument securityredaction failure

A paralegal is working through a scanned legal exhibit — an old set of loan documents from 2003, faxed and rescanned so many times that the bottom of every page has a grey halo. She runs the PDF through a redaction tool and asks it to find every Social Security number, account number, and date of birth in the file. The tool returns a clean result. No sensitive information found.

She marks a few names manually, applies the redactions, exports the file, and sends it out. A week later, opposing counsel's paralegal opens the same file, zooms in on page 12, and can read a customer's full SSN plainly, right under the header. The bars never covered it. The tool never flagged it. And technically, the tool was not wrong — it had no idea the SSN was there.

This is the failure mode that matters most on scanned and image-based PDFs, and almost nobody writes about it honestly. The image on the page is one thing. The text the software can find is another thing entirely. Whenever those two drift apart, your redactor is flying blind, and so are you.

Why image-based PDFs are a different problem

Most people think of a PDF as a single thing. It is not. There are at least two very different kinds of PDF file hiding under the same extension.

A native PDF is the kind you get when you export from Word, print-to-PDF from a browser, or generate one programmatically. The text on every page is stored as real text objects — actual characters, associated with fonts, positioned at specific coordinates. Your PDF reader draws them. Your search box finds them. Your redactor can see them and operate on them directly.

An image-based PDF is the kind you get when you scan a document on a photocopier, take a photo of a page with your phone, or save an inbound fax. The "page" is just a picture. There is no text layer at all, or there is a text layer that was bolted on afterwards by an OCR step. The characters you see are pixels, not letters.

The difference matters enormously for redaction. On a native PDF, the detector searches the text objects directly. There is no translation layer. If the string "SSN: 123-45-6789" appears on the page, the detector can find it character by character because those characters exist as discrete objects inside the file.

On an image PDF, the detector has no text objects to search. The software has to look at a picture of text and guess what the text says. That guess is called OCR (optical character recognition), and every detection the tool makes after that is downstream of the guess. If OCR guessed wrong, the detection is wrong, and the tool will not know.

OCR, briefly, so the rest of this post makes sense

OCR turns a picture of text into text. That is the whole job. Modern OCR engines are impressively good at it when conditions are good. On a clean, straight, well-lit scan of a laser-printed document at 300 DPI, a decent engine will read the page with very high accuracy — often 99% or better at the character level.

That 1% error rate sounds small until you realise what it means at document scale. A typical page carries 2,000 to 3,000 characters. At 99% accuracy, that is 20 to 30 mis-read characters per page. On a 50-page document, that is more than a thousand errors. Most of those errors do not matter — a mis-read comma or a space in the wrong place will not leak anything. But some of them matter enormously, and those are the ones your redactor is going to quietly miss.

The image quality problems that break OCR

Almost everything that makes OCR hard is upstream of OCR itself. It is about the image. If the image is bad, the OCR output is bad, and no amount of clever redaction software downstream can fix it. Here are the ways image quality goes wrong, in roughly the order you are likely to encounter them in a real workflow.

Low DPI scans (under 200)

Scanners and multifunction copiers default to various settings. Plenty of people scan at 150 DPI because the file size is smaller, or because that was the last setting somebody else used. At 150 DPI, character shapes start to blur into each other. A 6 and an 8 can look nearly identical. A lowercase "rn" and an "m" are often indistinguishable. A period becomes a speck of noise.

OCR engines work with pixel patterns. When those patterns are too small or too blurry, the engine has to pick a best-guess character. It will pick one, confidently, and your detector will trust it.

The rule of thumb: OCR wants 300 DPI minimum for standard body text, and 400 to 600 DPI for small print (fine print in contracts, footnotes, the text on standard government forms). Anything less is a coin toss on small characters.

Skew and rotation

A scan that is tilted a few degrees off straight is a problem. Some engines auto-deskew, some do not. A tilted line of text means the engine has to fit each character to its expected position at an angle, and small errors in the angle translate into large errors in which pixels belong to which character. Tables and forms get hit the hardest, because the line structure gives OCR a spatial anchor that a tilted scan destroys.

The same goes for pages that are rotated 90 degrees because someone fed them in sideways. Most engines notice and rotate. Not all of them do. If the engine misses it, the OCR output is complete gibberish, and a tool that "searches for PII" on gibberish will return a clean result.

JPEG artefacts and phone photos

There is a growing category of image-based PDFs that were never scanned at all. Somebody photographed the document with a phone, opened the image in a photo app, exported to PDF. The result is a JPEG buried inside a PDF wrapper, usually heavily compressed, sometimes at odd angles, almost always with uneven lighting.

JPEG compression damages text specifically. It is a compression format tuned for photographs, where small errors in colour gradients are invisible to the eye. Text has sharp black-white transitions everywhere, which is exactly the pattern JPEG is worst at. The edges of each character get fuzzed out, and OCR starts making creative guesses.

If the PDF you are about to redact came from a phone, assume OCR quality is a problem until proven otherwise.

Faded ink, carbon copies, and old faxes

Legal, medical, and financial work is full of documents from before everything was born-digital. Decades-old contracts, faxed medical records, carbon copies of tax returns. The ink is faded. The paper is tinted. The contrast is bad, and whatever the scanner did to normalise contrast usually made it worse.

On these documents, OCR is not going to return high-confidence results. It will return something, and that something will often look plausible, but character-level accuracy drops off a cliff. A rescan at higher contrast settings sometimes helps. Sometimes nothing helps, and the only safe move is to treat the document as if OCR cannot be trusted at all and do a full manual sweep.

Handwriting

Most general-purpose OCR engines will not attempt to read handwriting. They will either skip it entirely or return an empty string. A handwritten signature, a handwritten note in the margin, a handwritten Social Security number on an intake form — none of that shows up in the OCR text layer.

If your detector is searching the OCR text, none of that will be flagged. The detector will report a clean scan. The handwritten information is still sitting on the page, perfectly readable by any human who opens the file.

Specialised handwriting OCR engines exist, and cloud OCR services have been getting better at this in recent years, but you should not assume your redaction tool is using one. Most are not.

Stamps, signatures, and watermarks over text

When a signature stamp or a "CONFIDENTIAL" watermark crosses over printed text, OCR usually drops the underlying text entirely or reads it as garbage. The pixels are a mess of overlapping elements, and the engine cannot separate them.

This is a common pattern on court filings and medical records. Every page has a filed-stamp or a Bates number. The OCR engine reads the stamp and skips the text underneath. If the skipped text contained a name or an identifier, that information is now invisible to any tool that works on the text layer.

Multi-column layouts and tables

OCR engines try to identify the reading order of a page — which chunk of text comes first, which comes next. On simple single-column prose, they do fine. On multi-column layouts, newspaper-style formatting, tables with merged cells, or forms with complex structure, they often flow the text in the wrong order.

The failure mode this creates for redaction is subtle. The individual characters are correct. The words are correct. But a name from column 2 ends up glued to an unrelated sentence from column 1, and a detector that relies on context (NER models especially) no longer sees the name as a name. It sees it as part of an unrelated phrase. The rule does not fire.

Character confusion: the failure you will not see coming

Of all the ways OCR can go wrong, the most dangerous for redaction is the one that looks correct. Character confusion.

There is a short list of characters that modern OCR engines routinely mix up, even on reasonably clean documents:

  • 0 and O and D and Q
  • 1 and l and I and |
  • 5 and S
  • 8 and B
  • 6 and G
  • 2 and Z
  • rn and m
  • cl and d
  • . and , and '

Now look at what happens when these confusions hit a PII pattern. A Social Security Number is XXX-XX-XXXX — three digits, two digits, four digits. A detector looking for SSNs is typically running a regex like \d{3}-\d{2}-\d{4}, or something smarter that validates the number structure.

If OCR read "123-45-6789" as "l23-4S-67B9", the regex sees a string that is no longer all digits. No match. No detection. No redaction. And the printed page still shows 123-45-6789 clearly for anyone who opens the file.

Account numbers, routing numbers, medical record numbers, ICD codes, NPI numbers — every structured identifier your detector is looking for runs on the assumption that the text it is reading matches the pattern it expects. OCR confusion breaks that assumption silently.

What makes this particularly dangerous is that the page looks fine. A human reading the exported PDF will not notice that the OCR layer has errors. They will see the correct digits, because the digits on the page are still correct. The failure is invisible unless somebody deliberately compares the OCR text layer against the image, character by character, which nobody does.

Why automated detection silently fails on image PDFs

Put the pieces together. An image PDF goes into a redaction tool. The tool runs OCR on each page and gets back a text layer. The detector runs its patterns and rules on that text layer. It finds whatever it finds and flags it.

Every single assumption in that pipeline is conditional on OCR being correct. When OCR is wrong, the tool is not looking at the document — it is looking at a distorted summary of the document. The tool reports what it found in that summary. It cannot report what it failed to find, because it does not know what it failed to find.

This is why a scanned PDF that comes back with "no sensitive information detected" is a red flag, not a green light. On a well-prepared native PDF, a clean result means the detector searched the real text and found nothing. On an image PDF, a clean result often means the detector searched an imperfect OCR transcription and found nothing there — which tells you absolutely nothing about what is actually on the page.

There is a counterintuitive corollary to this. Adding an OCR layer to an image PDF can actually make it less safe, not more. The raw image, with no text layer at all, forces every human and every tool to treat the document visually. The moment you add an OCR layer, detection pipelines start trusting it. Now the tool is confidently redacting an imperfect transcription while the real image sits untouched underneath. If the OCR layer is bad and nobody knows, the document is more dangerous after OCR than before it, because everyone downstream now believes the detector has "checked" the page.

How to tell if your PDF is image-based in the first place

Before you trust any redaction result on a PDF, you should know which kind of PDF you are working with. There is a one-second test.

Open the PDF in any reader. Try to select a word on the page by click-and-dragging. If the cursor highlights the word cleanly and you can copy it to the clipboard, the page has a real text layer. If nothing selects, or you can only draw a rectangle selection (like selecting an image), the page is an image.

A more informative version: try to select a whole paragraph. If the selection follows the text neatly, you have a native PDF. If the selection jumps around, highlights in blocks that do not match the visible lines, or picks up extra characters that are not visible, you have an image PDF with an OCR text layer bolted on. That text layer is the OCR output — and it is the only thing your redaction tool is going to see.

What to actually do about it

Knowing the problem exists is the first half. The second half is a practical workflow for getting the redaction right despite it.

Step 1: Improve the image before you redact

If the document is bad, fix the image before running OCR. The better the image, the better the OCR, the better everything downstream. Concretely:

  • Rescan at 300 DPI or higher if you have access to the original. 400 DPI for documents with small print. This is the single highest-impact thing you can do.
  • Straighten skewed pages — most scanning software has a deskew option, and it is worth turning on.
  • Increase contrast on faded documents. Most scanners have a brightness/contrast dial for a reason.
  • Do not save as JPEG if you have a choice. PNG or a lossless PDF/A format keeps character edges sharp. JPEG is for photographs, not text.

If you do not have access to the original, you are stuck with whatever image was handed to you. You can still re-OCR (see Step 2), and you can still do the manual sweep (see Step 3).

Step 2: Re-OCR with a better engine if you have any doubt

The OCR that came with your PDF was whatever engine the original scanner or software bundled. It may be years old. It may be the cheapest available option. It may have been optimised for speed rather than accuracy.

Modern OCR engines vary wildly in quality. Tesseract (the open-source standard) has improved dramatically in recent versions, and the current release is far better than what shipped in older tools. Cloud OCR from Google, Microsoft, and Amazon produces noticeably better output on difficult documents, though it means the file leaves your device — which is a tradeoff you should weigh carefully for confidential material.

If the OCR text layer in the PDF you are about to redact looks obviously wrong when you select and copy it, replace it. Re-run OCR with a better engine and use the new text layer for detection.

Step 3: Do a manual visual sweep, always

This is not optional on image PDFs. It is the step that catches everything OCR missed, and OCR will always miss something on an image PDF you did not control.

Go through every page. Look for anything the tool did not flag that should have been flagged: names, numbers, handwritten notes, stamps, marginalia. Draw manual redaction bars over anything you find. This is tedious on long documents and unavoidable on short ones.

The mental model: treat the auto-detection output as a first draft. On image PDFs, it is always a first draft, no matter how good the tool is.

Step 4: Export flattened, always

When you export the redacted file, flatten everything to images. A flattened export converts every page to a picture at export time, with the redaction bars baked into the pixels. No text layer survives. No OCR text layer survives. Nothing underneath the bars survives.

This step is important because any OCR text the tool added will still be sitting inside the file after redaction, unless you explicitly remove it. If the OCR guessed the SSN wrong and the detector missed it, the raw (incorrect) SSN is still in the text layer, and a copy-paste from the exported PDF can reveal it. Flattening destroys that layer.

For image PDFs specifically, flattened export is the fail-closed default. It trades a slightly larger file size and non-searchable output for a guarantee that nothing from the text layer leaks.

A verification routine for image PDFs

The standard verification routine for native PDFs is to select text under the bars, search for things that should be gone, and check metadata. On image PDFs, most of those steps do not apply directly, because there is no text to select in the first place. For the native-PDF version, see How to check whether a PDF was redacted securely before sharing it. Here is the adapted version for image PDFs.

  1. Visually inspect every page, slowly. Not a skim. Open the page at 100% zoom or larger and look at it. Anything you can read, somebody else can read. If you can read it and it should not be there, your redaction missed it.
  2. Zoom in on the redaction bars themselves. Are they fully covering the content underneath? On image PDFs, a bar is a shape drawn on top of the image. If it is slightly too small or slightly offset, a sliver of the original text will peek out at the edges. Check every bar.
  3. Select the whole page and copy to clipboard. Paste into a text editor. If anything comes out, there is still a text layer. On a properly flattened image PDF, nothing should paste. If text pastes, you did not actually flatten on export. Re-export with flattening enabled.
  4. Check the file in a different reader. Open it in Chrome or Edge instead of whatever you used to create it. Zoom in. Scroll through. Different readers sometimes render PDFs slightly differently, and a bar that looked correctly placed in one reader occasionally reveals a gap in another.
  5. Check the metadata. Scanned PDFs often carry metadata from the scanner — the scanner model, the software version, sometimes the name of the person logged into the scanning workstation. None of that should go out with the file. A flattened export typically strips it; verify that it did, either by checking the reader's document properties dialog or running the file through exiftool if you have it.

This routine takes longer than the native-PDF version because step 1 is a real visual inspection, not a mechanical click-through. On a long document, that is hours of work. It is also the only thing standing between you and a public leak, and there is no software shortcut for it.

Where RedactVault fits, and where it does not

RedactVault runs OCR on image PDFs in the browser before the detection step. Two things about how it handles the OCR phase are worth knowing if you are working with scanned documents.

First, it surfaces low-confidence OCR regions instead of hiding them. If the engine was unsure about a block of characters, you see that as a warning on the page, not a silent best-guess. You can decide whether to draw a manual redaction over the whole region rather than trust the character-level output.

Second, the export step flattens by default on image PDFs. The OCR text layer is removed, the redaction bars are baked into the page image, and nothing from the intermediate text layer survives the export. This matches the Step 4 recommendation above, and it means a step you could forget on other tools is not something you have to remember here.

What RedactVault does not do is magically make bad scans readable. If the image is blurry, the OCR is going to be limited by the image, and no tool can invent characters that were never captured. The remedy for a bad scan is a better scan or a manual sweep, not a better redactor. Be honest with yourself about that when you are deciding whether a document is ready to share.

The bottom line

On a native PDF, you can mostly trust your redactor to find what it says it found. On an image PDF, you cannot, and pretending otherwise is how sensitive information leaks.

The reason is not that redaction tools are bad. It is that image PDFs introduce a translation step — OCR — that every downstream decision depends on. When OCR is even slightly wrong, the tool is confidently redacting the wrong file: a distorted transcription of the real document, not the document itself. The human reading the exported PDF still sees the original image, untouched, with whatever OCR missed still plainly visible.

The fix is not to pick a better tool. The fix is to treat every image PDF as if auto-detection is a first draft, and to do the visual sweep yourself. It is slow, boring, and unavoidable.

For the broader picture on why redacted text survives in PDFs in general, read Can redacted text still be recovered from a PDF?. For the technique side — how to apply redactions that hold up on native PDFs — see How to redact a PDF properly so the hidden text is actually gone. And if you want a tool that handles image PDFs with low-confidence OCR warnings and flattens by default on export, try RedactVault.

FAQ

Common questions

How can I tell if my PDF is image-based without technical tools?

Open the file in any PDF reader and try to click-and-drag to select a word. If the word highlights and you can copy it, the page has a real text layer. If nothing selects — or if selecting gives you a rectangle like you are selecting an image — the page is an image. Check several pages, because documents often mix page types.

Does cloud OCR produce better results than local OCR?

Usually yes, especially on difficult documents with skew, low contrast, or handwriting. The large cloud OCR services (Google, Microsoft, Amazon) have invested heavily in their models and typically outperform open-source engines on edge cases. The tradeoff is that cloud OCR sends your file to a third-party server, which is a non-starter for many confidential workflows. Weigh the quality benefit against the privacy cost.

Is it safe to re-OCR a confidential document with a cloud service?

It depends on your compliance context. For privileged legal material, PHI under HIPAA, or anything covered by a client confidentiality agreement, you probably should not. For less sensitive internal documents, the tradeoff may be worth it. If you do use cloud OCR on sensitive material, check whether the provider retains the document, whether they use it for training, and whether you have a Business Associate Agreement or equivalent in place.

What DPI should I scan at for reliable redaction?

300 DPI for standard body text, 400 to 600 DPI for small print like contract footnotes or form fields. Anything under 200 DPI starts to produce unreliable OCR on small characters, which is where most sensitive data lives — account numbers and dates of birth are usually printed smaller than the surrounding body text.

Can I trust an AI-based redactor to catch what regex-based tools miss on scanned documents?

Not entirely, and for the same underlying reason. An AI NER model for PII detection still operates on the OCR text layer, not on the image pixels. If OCR mangled the text, the AI model sees the mangled version and makes decisions based on that. AI models are often better than regex at handling mild OCR noise because they use context, but they are still downstream of the OCR step and inherit every failure it produces.

What if the document is partly native text and partly scanned images?

Treat each section according to what it actually is. The native pages can be redacted normally with standard text-based detection. The scanned pages need the image-PDF workflow: better OCR if available, manual visual sweep, flattened export. Most redaction tools handle mixed documents in a single pass, but the verification step has to cover both modes — do the text-based checks on the native pages and the visual checks on the scanned pages.

RedactVault

Want OCR confidence warnings built into redaction?

RedactVault surfaces low-confidence OCR regions on image PDFs so you can catch what character recognition got wrong — and flattens every export so the text layer never leaves the tool.

Open RedactVault

Continue reading