Category Archives: OnSomething

Two column PDF to eReader format

Lately I’ve come across a few old PDF’s from faculty days. Some of them were quite interesting even today but reading PDF on anything other than large screen is painful, at least to me. So, I tried to figure out how to convert to a tablet or eReader friendly format. In my case, AZW3 for Kindle reader app on my tablet.

When the source PDF is well formatted, you’re in luck because you can do wonders with Calibre. Essentially, it will do an excellent job. The conversion page in Calibre manual pretty much explains it. One of the things that bothered me were the page numbers. Couldn’t get rid of them in the target format. So, Search & Replace and some regex magic to the rescue.

calibre_search_and_replace

When in Search & Replace, you can start the wizard and browse the text to be converted. It is shown in HTML format, so it is easy to read tags, and it is shown in the way it will get converted. Thus, you can see all the mistakes Calibre made out of the box. The Search & Replace rules apply before other conversion phases kick in. So, for example, you can denote headers that were not recognized, and Calibre will then use this to create the table of content. In the above example, I used two replacements:

  • <p>([A-Z]|\s)+</p> => <h1>\1</h1>, this will convert e.g. <p>SOME TITLE</p> to <h1>SOME TITLE</h1>
  • <p>\d+</p> => nothing, which will essentially delete page numbers

You can think of Search & Replace as a sort of preprocessor and depending on the quality of the PDF, you can tweak the conversion to your liking.

Now, there are kinds of PDF’s that are downright impossible to convert in this manner. One of the issues is a PDF with columns per page. I didn’t find an easy way to tell Calibre about the columns. The text would get scrambled, with pieces of left column being mixed with the right one etc.

So, paperCrop to the rescue! paperCrop is a tool that splits the columns, go figure 🙂 It actually does quite a bit more, but I used it for that purpose. It suggests reasonably well what to split and where, but you can tweak the parameters easily. It also offers manual clipping if you like that approach. In the below example, I used only parts of the given PDF.

papercrop_in_action

When the single column PDF was produced, it was again a straight job to convert using Calibre.

If the source PDF is text only, there is another option you can use. Just copy the PDF content into a TXT file. There you can edit the content easily and effectively prepare it for a conversion in Calibre with default parameters. E.g. I tend to remove the content, and let Calibre generate it from the rest of the text. This solution came in handy with word splitting often seen in PDF’s. Words near the right margin in PDF tend to be split in two rows with a dash in between, e.g. seren-dipity. So, you can replace those with your favourite editor, again using regexps. Or, if you prefer, you can try that in Calibre.

In the end, the result is a very usable eReader formatted book. Haven’t tried Calibre conversion for PDF’s with graphs or other images, but hopefully it will work just as well.

And a friendly note, for Linux boxes Calibre install instructions strongly suggest using their binary install instead of distribution packaged versions.

Resources:

  1. Calibre
  2. Calibre blog
  3. Calibre converson manual
  4. Calibre regex manual
  5. Paper Crop
  6. Briss, another tool for column splitting, haven’t tried it
Advertisements
Tagged , , , , , ,

Full-fed or a Developer?

A week ago I had the pleasure of speaking to/with Computer Science students at University of Pula. The working title was “Full-fed or a Developer?” and the idea was to present a real life view on what it is to be a developer. Although this was formally a lecture, me and my colleagues (5 of us) imagined this to be more of an “involved theatrical show”, but it seems the students were too much into their “sit and listen” mode to be able to break the habits. Or we were boring. Take your pick 🙂

From gaining weight, through loosing eye sight and suffering socially, all the way to the preferred language and the rest of the developers tool-belt, I believe they got at least a glimpse of the “glorious” life to be 🙂 Well, it was is not all that black, and in the end the conclusion was that we do it because we like it, so all the difficulties actually don’t weight in too much.

From all this, what actually stick with me was a few questions from a student.

First question was: How to practice our skills?

Among a number of answers, a few were common:

  • get on Github and read code
  • listen to the community / podcasts of the platform / language of their choice
  • get out and write something, post it to Github, let people comment
  • Stackoverflow
  • write a blog about their experiences
  • join some open source project and offer help (documenting if nothing else)

You might notice that we didn’t offer one very important thing, and that was actually the second question: How to decide which direction to take in practice? How do I choose a thing to do? And of course, she was right on the spot.

So, the best we could offer was to listen to FLOSS Weekly. It offers the chance for beginner to listen to people that are actually behind so many great projects, and then decide which appeals to them the most. Whether they choose by the author of the project or e.g. by technology used or whichever thing points them in that direction, I feel it is a great way of choosing. Somehow it seems to me that it is a matter of heart, and that can never fail.

Very good questions, that didn’t spring to mind until she asked them. We talked about importance of practice, of reading other peoples code, of communication in teams, tolerance and what not but skipped that very important part, how to learn all of that. Actually, I found these questions to be very important for life (and craft), and the attitude correct and fair. Hope it will bring good things to her, and maybe she’ll be both full fed and a happy developer 🙂

Tagged ,