I needed to grab some text from a PDF for further processing and preferring CLI to GUI I thought I’d look for something that would work with my favourite tool of choice Python.
Tried the following first;
Then came across pdfMiner a PDF parser and analyzer written entirely in Python.
Installation is simple, download, unpack and run setup.py;
>python setup.py install
If you get a permissions issue when installing you may need to use sudo
pdfMiner also comes with a couple of additional tools and some sample content so you can get started straight away
It does a reasonable job of delivering text in the correct order, but your mileage will vary depending on what you used to create the PDF and its complexity. Titles sat amongst multiple columns for example can cause some issues.