Calling It Splits, a PDF Breakup Story

As a former book designer and engineer, I’m wary of PDFs. They come with too many gotchas. Fixed layout documents can’t be reformatted for smaller screens. Poor layering and organization lead to copy & paste snafus. And if you need data from a scan, you’re stuck hand-jamming.

Bottom line, PDFs are for presenting information, not exchanging it. And while we’re on the subject, Excel, Word, and other proprietary formats aren’t much better. This is what APIs & datastores are for!

Of course, you don’t always have a choice. That’s why you need Docsplit from the team at DocumentCloud. Using Ruby to pipe data in & out of GraphicsMagick, Poppler, PDFtk, Tesseract, and LibreOffice, Docsplit can extract text from and break apart multi-page PDFs & Office documents (with OCR fallback).

Docsplit is packaged as a CLI and a Ruby library, putting you in charge of your data.

sh # CLI example - extract text from each page of a PDF with OCR fallback docsplit text path/input.pdf --pages all

ruby # API example - extract text from a MS Word doc without OCR docs = Dir['path/input.doc'] Docsplit.extract_text(docs, :ocr => false, :output => 'output/')

ruby # Extract metadata (# of slides) from PowerPoint file & convert to PDF Docsplit.extract_length('input.pptx') # => 9 Docsplit.extract_pages('input.pptx')

Join the discussion

2017 Neal Shyam