Quick text grab from Pdf

I needed to grab some text from a PDF for further processing and preferring CLI to GUI I thought I’d look for something that would work with my favourite tool of choice Python.

Tried the following first;

pyPDF – Output a lot of junk, maybe a Unicode issue?
swfTools – Required a bunch of libraries installing and seemed to be overkill for just grabbing a bit of text.

Then came across pdfMiner a PDF parser and analyzer written entirely in Python.

Installation is simple, download, unpack and run setup.py;

>tar -xzvf pdfminer-20110515.tar.gz
>cd pdfminer-20110515
>python setup.py install

If you get a permissions issue when installing you may need to use sudo

>sudo python setup.py install

pdfMiner also comes with a couple of additional tools and some sample content so you can get started straight away :)

>pdf2txt.py samples/simple1.pdf

It does a reasonable job of delivering text in the correct order, but your mileage will vary depending on what you used to create the PDF and its complexity. Titles sat amongst multiple columns for example can cause some issues.

Comments are closed.