Monthly Archives: September 2013

Two column PDF to eReader format

Lately I’ve come across a few old PDF’s from faculty days. Some of them were quite interesting even today but reading PDF on anything other than large screen is painful, at least to me. So, I tried to figure out how to convert to a tablet or eReader friendly format. In my case, AZW3 for Kindle reader app on my tablet.

When the source PDF is well formatted, you’re in luck because you can do wonders with Calibre. Essentially, it will do an excellent job. The conversion page in Calibre manual pretty much explains it. One of the things that bothered me were the page numbers. Couldn’t get rid of them in the target format. So, Search & Replace and some regex magic to the rescue.


When in Search & Replace, you can start the wizard and browse the text to be converted. It is shown in HTML format, so it is easy to read tags, and it is shown in the way it will get converted. Thus, you can see all the mistakes Calibre made out of the box. The Search & Replace rules apply before other conversion phases kick in. So, for example, you can denote headers that were not recognized, and Calibre will then use this to create the table of content. In the above example, I used two replacements:

  • <p>([A-Z]|\s)+</p> => <h1>\1</h1>, this will convert e.g. <p>SOME TITLE</p> to <h1>SOME TITLE</h1>
  • <p>\d+</p> => nothing, which will essentially delete page numbers

You can think of Search & Replace as a sort of preprocessor and depending on the quality of the PDF, you can tweak the conversion to your liking.

Now, there are kinds of PDF’s that are downright impossible to convert in this manner. One of the issues is a PDF with columns per page. I didn’t find an easy way to tell Calibre about the columns. The text would get scrambled, with pieces of left column being mixed with the right one etc.

So, paperCrop to the rescue! paperCrop is a tool that splits the columns, go figure 🙂 It actually does quite a bit more, but I used it for that purpose. It suggests reasonably well what to split and where, but you can tweak the parameters easily. It also offers manual clipping if you like that approach. In the below example, I used only parts of the given PDF.


When the single column PDF was produced, it was again a straight job to convert using Calibre.

If the source PDF is text only, there is another option you can use. Just copy the PDF content into a TXT file. There you can edit the content easily and effectively prepare it for a conversion in Calibre with default parameters. E.g. I tend to remove the content, and let Calibre generate it from the rest of the text. This solution came in handy with word splitting often seen in PDF’s. Words near the right margin in PDF tend to be split in two rows with a dash in between, e.g. seren-dipity. So, you can replace those with your favourite editor, again using regexps. Or, if you prefer, you can try that in Calibre.

In the end, the result is a very usable eReader formatted book. Haven’t tried Calibre conversion for PDF’s with graphs or other images, but hopefully it will work just as well.

And a friendly note, for Linux boxes Calibre install instructions strongly suggest using their binary install instead of distribution packaged versions.


  1. Calibre
  2. Calibre blog
  3. Calibre converson manual
  4. Calibre regex manual
  5. Paper Crop
  6. Briss, another tool for column splitting, haven’t tried it
Tagged , , , , , ,

Jenkins Gitlab Hook Plugin reorganized

The Jenkins Gitlab Hook Plugin received a major refactoring. The goal was to separate concerns from existing modules and to make the project testable. Github repo now contains Java binaries needed to run the rspec tests, but hopefully you’ll find the new organisation a bit more intent revealing and easier to follow.

I’ve used the use case approach, and extracted related services so now all the domain knowledge is contained within models sub folders:


The remaining models in the root models folder are all directly Jenkins related and left there so Jenkins can load them first and register the plugin and the related web hook correctly.

The entire domain knowledge is now also testable. I chose the rspec to run the tests and have created the related spec helped that loads all the Java dependencies and models from the root folder. To run the specs, you’ll need to setup JRuby so it runs in Ruby 1.9 compatibility mode. Just add the following switches to your JRUBY_OPTS environment variable: –1.9 -Xcext.enabled=true -X+0.

The v1.0.0 release has all the goodies, so feel free to upgrade your Jenkins environments.

Tagged , ,