Scrub, redact, and relax

I was originally going to write about a Python utility that scans PDFs for links to other PDFs and downloads them. But when I tried it on one of my dad’s papers, it gave me this:

text - Author = ss999 - CreationDate = D:20081030132020-04'00' - Creator = PScript5.dll Version 5.2.2 - ModDate = D:20091214160134-05'00' - Pages = 23 - Producer = Acrobat Distiller 6.0.1 (Windows) - Title = Microsoft Word - Value-Oct3008.doc

While there aren’t any PDF links in this one, there is some personal metadata. The Author field brazenly exposes pop’s Yale login id (I’ve obfuscated it here). Another paper revealed that he was still running Windows NT in 2009. 😮

You don’t need special software for this. All you have to do is right click on a PDF in OSX and select Get info:

And have you ever drawn a black box over some text—like an account number—in Preview in order to redact it? You probably think you’re being super clever, but PDFs are just PostScript (which is essentially plaintext).

Bottomline: PDFs aren’t very secure at all.

But, you can scrub & redact PDFs, (ideally before you share them). You just need to use PDF Redact Tools, written by Micah Lee at FirstLook Media.

It’s actually a pretty low tech solution: PDF Redact Tools uses ImageMagick to split a PDF into its individual pages, converts each one to a static PNG, and then recombines them into a PDF without any metadata. You can do it right from the command line: pdf-redact-tools --sanitize filename.pdf

Redacting is a little more involved, but basically, you split apart the PDF, redact individual images in an image editor, and finally recombine them into a new PDF.

For all the academics out there, I’d be interested to hear if JSTOR or Google scrub the PDFs they index, host, and sell. I’m betting they don’t.

Comments :)
2017 Neal Shyam