August 28, 2008

Make etexts pretty with GutenMark

Author: Dmitri Popov

Project Gutenberg, the online library of more than 25,000 free books, is a treasure trove for bookworms and casual readers alike, but turning electronic text files into a readable form is not as easy as it may seem. In theory, since etexts are just plain text files, you should be able to open and read them on any platform without any tweaking. In practice, however, this approach rarely works. Hard line breaks, for example, may ruin the text flow, making it virtually impossible to read the book on a mobile device. Another problem is that most books are stored as single files, so locating a particular chapter or section in a lengthy book can be a serious nuisance. Then there are minor but annoying formatting quirks, such as inconsistent handling of italicized text, use of straight quotes instead of smart ones, and so on.

Fixing all these and other issues manually to make an etext readable -- or even printable -- is a daunting proposition. Thankfully, the GutenMark tool can take most of the burden off your shoulders. The utility converts Project Gutenberg etexts into neatly formatted HTML or LaTeX files.

The goal of the GutenMark project is to create a tool that produces files that don't require any additional cleanup or tweaking. While it still has some way to go before it achieves this goal, GutenMark does a remarkable job of turning etexts into readable and printable files.

Initially, GutenMark was a command-line tool, but the latest version of the application comes with the GUItenMark graphical interface and the GutenSplit tool, which can split a single file into multiple chapters. These tools come from a single installer, but before you download and run it on your system, you have to make sure that the system has all the required packages: glitz, libpng, and libtiff. On Ubuntu, you can install them using the sudo apt-get install libglitz1 libpng libtiff command. You also need to create a couple of symbolic links, as follows:

sudo ln --symbolic /usr/lib/libtiff.so.4 /usr/lib/libtiff.so.3
sudo ln --symbolic /usr/lib/libexpat.so.1 /usr/lib/libexpat.so.0

Now you can download the GutenMark installer and make it executable. The installation instructions on GutenMark's Web site recommend that instead of using the chmod command, you make the installer executable by right-clicking on it and ticking the Execute check box. Run the installer and GutenMark is ready to go.

Using the GUI version of GutenMark to convert etext is straightforward. Use the Input Files pane on the left to add one or several etexts, then configure the available conversion options by ticking the desired check boxes. Most of the options are self-explanatory, and you can experiment with different settings to achieve the best results. GutenMark allows you to save different settings as profiles. You can, for example, create two separate profiles for converting etexts to HTML and LaTeX, or you can set up different profiles for different languages.

When converting etexts to HTML, you have an option to split the source file into multiple chapters. To enable this feature, tick the Split at headings check box, and specify the splitting points. Usually, ticking the H1 (Heading 1) check box works just fine, but you can chop the etexts into smaller pieces by enabling other heading options. If you choose to split the etext, make sure you enable the Table of contents option, which creates a separate HTML file with links to the created chapters.

To convert the selected etext, press the Arrow button, and the converted files appear in the Output Files pane. You can then open the converted files directly from within GutenMark by double-clicking on them.

If you prefer to use GutenMark from the command line, the Usage page provides a detailed description of the available command-line options. Even if you stick to the GUI, the page can help you to figure out what each option does.

Although GutenMark does a formidable job of converting etexts to HTML format, which is readable on virtually any device, the converted files might still need some manual tweaking. It's a good idea to go through a converted file and correct any remaining issues before you load it to your device. This is, however, a minor nuisance compared to converting an entire etext by hand.

Categories:

  • Tools & Utilities
  • Desktop Software