Scrub, redact, and relax11.12.15 · python
I was originally going to write about a Python utility that scans PDFs for links to other PDFs and downloads them. But when I tried it on one of my dad’s papers, it gave me this:
- Author = ss999
- CreationDate = D:20081030132020-04'00'
- Creator = PScript5.dll Version 5.2.2
- ModDate = D:20091214160134-05'00'
- Pages = 23
- Producer = Acrobat Distiller 6.0.1 (Windows)
- Title = Microsoft Word - Value-Oct3008.doc
While there aren’t any PDF links in this one, there is some personal metadata. The
Author field brazenly exposes pop’s Yale login id (I’ve obfuscated it here). Another paper revealed that he was still running Windows NT in 2009. 😮
You don’t need special software for this. All you have to do is right click on a PDF in OSX and select Get info:
And have you ever drawn a black box over some text—like an account number—in Preview in order to redact it? You probably think you’re being super clever, but PDFs are just PostScript (which is essentially plaintext).
Bottomline: PDFs aren’t very secure at all.
It’s actually a pretty low tech solution: PDF Redact Tools uses ImageMagick to split a PDF into its individual pages, converts each one to a static PNG, and then recombines them into a PDF without any metadata. You can do it right from the command line:
pdf-redact-tools --sanitize filename.pdf
Redacting is a little more involved, but basically, you split apart the PDF, redact individual images in an image editor, and finally recombine them into a new PDF.
For all the academics out there, I’d be interested to hear if JSTOR or Google scrub the PDFs they index, host, and sell. I’m betting they don’t.