Two column PDF to eReader format

Lately I’ve come across a few old PDF’s from faculty days. Some of them were quite interesting even today but reading PDF on anything other than large screen is painful, at least to me. So, I tried to figure out how to convert to a tablet or eReader friendly format. In my case, AZW3 for Kindle reader app on my tablet.

When the source PDF is well formatted, you’re in luck because you can do wonders with Calibre. Essentially, it will do an excellent job. The conversion page in Calibre manual pretty much explains it. One of the things that bothered me were the page numbers. Couldn’t get rid of them in the target format. So, Search & Replace and some regex magic to the rescue.

When in Search & Replace, you can start the wizard and browse the text to be converted. It is shown in HTML format, so it is easy to read tags, and it is shown in the way it will get converted. Thus, you can see all the mistakes Calibre made out of the box. The Search & Replace rules apply before other conversion phases kick in. So, for example, you can denote headers that were not recognized, and Calibre will then use this to create the table of content. In the above example, I used two replacements:

<p>([A-Z]|\s)+</p> => <h1>\1</h1>, this will convert e.g. <p>SOME TITLE</p> to <h1>SOME TITLE</h1>
<p>\d+</p> => nothing, which will essentially delete page numbers

You can think of Search & Replace as a sort of preprocessor and depending on the quality of the PDF, you can tweak the conversion to your liking.

Now, there are kinds of PDF’s that are downright impossible to convert in this manner. One of the issues is a PDF with columns per page. I didn’t find an easy way to tell Calibre about the columns. The text would get scrambled, with pieces of left column being mixed with the right one etc.

So, paperCrop to the rescue! paperCrop is a tool that splits the columns, go figure 🙂 It actually does quite a bit more, but I used it for that purpose. It suggests reasonably well what to split and where, but you can tweak the parameters easily. It also offers manual clipping if you like that approach. In the below example, I used only parts of the given PDF.

When the single column PDF was produced, it was again a straight job to convert using Calibre.

If the source PDF is text only, there is another option you can use. Just copy the PDF content into a TXT file. There you can edit the content easily and effectively prepare it for a conversion in Calibre with default parameters. E.g. I tend to remove the content, and let Calibre generate it from the rest of the text. This solution came in handy with word splitting often seen in PDF’s. Words near the right margin in PDF tend to be split in two rows with a dash in between, e.g. seren-dipity. So, you can replace those with your favourite editor, again using regexps. Or, if you prefer, you can try that in Calibre.

In the end, the result is a very usable eReader formatted book. Haven’t tried Calibre conversion for PDF’s with graphs or other images, but hopefully it will work just as well.

And a friendly note, for Linux boxes Calibre install instructions strongly suggest using their binary install instead of distribution packaged versions.

Resources:

Calibre
Calibre blog
Calibre converson manual
Calibre regex manual
Paper Crop
Briss, another tool for column splitting, haven’t tried it

8 thoughts on “Two column PDF to eReader format”

mjmatos says:

27. September 2013 at 04:29

Mac users can use Briss (http://sourceforge.net/projects/briss/) for the same effect Papercrop is used.

- elvanja says:
  
  27. September 2013 at 05:43
  
  I’ve stumbled across paperCrop first, so didn’t try Briss. But, it’s been added to the resources section.
  
anonymous says:

24. November 2013 at 12:48

Give k2pdfopt a try. It is similar to PaperCrop but has had more development time. You might check the PDF forum section on mobileread.com as well.

- elvanja says:
  
  24. November 2013 at 20:08
  
  Thanks for the tips!
  I found http://www.willus.com/k2pdfopt/ but could not locate the PDF section on mobileread.
  Care to share the link?
  
  - anonymous says:
    
    24. November 2013 at 22:19
    
    http://www.mobileread.com/forums/forumdisplay.php?f=184
Tom says:

05. November 2014 at 08:55

If you need to convert pdf files to an eReader friendly format, check out this free conversion toolkit http://kitpdf.com/ which provides fast results.

- elvanja says:
  
  05. November 2014 at 09:05
  
  Nice one! Thanks 😀
  
fatih says:

15. November 2015 at 20:54

thanks for the topic:) k2pdfopt worked for me.

(Un)Boxing Code

5 out of 4 developers have never heard about this site, why would you?