Lately I’ve come across a few old PDF’s from faculty days. Some of them were quite interesting even today but reading PDF on anything other than large screen is painful, at least to me. So, I tried to figure out how to convert to a tablet or eReader friendly format. In my case, AZW3 for Kindle reader app on my tablet.
When the source PDF is well formatted, you’re in luck because you can do wonders with Calibre. Essentially, it will do an excellent job. The conversion page in Calibre manual pretty much explains it. One of the things that bothered me were the page numbers. Couldn’t get rid of them in the target format. So, Search & Replace and some regex magic to the rescue.
When in Search & Replace, you can start the wizard and browse the text to be converted. It is shown in HTML format, so it is easy to read tags, and it is shown in the way it will get converted. Thus, you can see all the mistakes Calibre made out of the box. The Search & Replace rules apply before other conversion phases kick in. So, for example, you can denote headers that were not recognized, and Calibre will then use this to create the table of content. In the above example, I used two replacements:
- <p>([A-Z]|\s)+</p> => <h1>\1</h1>, this will convert e.g. <p>SOME TITLE</p> to <h1>SOME TITLE</h1>
- <p>\d+</p> => nothing, which will essentially delete page numbers
You can think of Search & Replace as a sort of preprocessor and depending on the quality of the PDF, you can tweak the conversion to your liking.
Now, there are kinds of PDF’s that are downright impossible to convert in this manner. One of the issues is a PDF with columns per page. I didn’t find an easy way to tell Calibre about the columns. The text would get scrambled, with pieces of left column being mixed with the right one etc.
So, paperCrop to the rescue! paperCrop is a tool that splits the columns, go figure :-) It actually does quite a bit more, but I used it for that purpose. It suggests reasonably well what to split and where, but you can tweak the parameters easily. It also offers manual clipping if you like that approach. In the below example, I used only parts of the given PDF.
When the single column PDF was produced, it was again a straight job to convert using Calibre.
If the source PDF is text only, there is another option you can use. Just copy the PDF content into a TXT file. There you can edit the content easily and effectively prepare it for a conversion in Calibre with default parameters. E.g. I tend to remove the content, and let Calibre generate it from the rest of the text. This solution came in handy with word splitting often seen in PDF’s. Words near the right margin in PDF tend to be split in two rows with a dash in between, e.g. seren-dipity. So, you can replace those with your favourite editor, again using regexps. Or, if you prefer, you can try that in Calibre.
In the end, the result is a very usable eReader formatted book. Haven’t tried Calibre conversion for PDF’s with graphs or other images, but hopefully it will work just as well.
And a friendly note, for Linux boxes Calibre install instructions strongly suggest using their binary install instead of distribution packaged versions.
- Calibre blog
- Calibre converson manual
- Calibre regex manual
- Paper Crop
- Briss, another tool for column splitting, haven’t tried it