DjVu: Saving for our paper heritage

61

Author: Marco Fioretti

Did you know that you can load in your browser an old map showing the operations of General Washington in 1776-77, or the 1927 handmade sketch of the first negative feedback amplifier? These Web pages are in a highly compressed image format called DjVu that’s highly suitable for archival material. Here’s how you can start enjoying DjVu.

According to DjVu’s developers, more than 90% of the information in the world is stored on paper. Many of these documents contain graphics and photos that are of significant value, and almost none of that information is available on the Internet.

It hasn’t been feasible to put all of that information on the Internet because commonly used image formats — JPEG, GIF, and PNG — would create unreasonably large files if historic documents were scanned at a resolution that kept images and text readable. AT&T Labs developed DjVu to solve this problem. DjVu gives you the ability to zoom in on documents in real time or to pan around images larger than your screen (see Figure 1).

Short DjVu technical tour

DjVu is actually a bundle of three different image compression technologies. The first, DjVuPhoto, does progressive decoding and rendering. You get an initial, usable version of the whole image quickly, and it gets better as the download continues. This lets you figure out immediately if you aren’t interested in a document and move on.

Otherwise, the more you look at something — that is, the more time you spend on it — the better its resolution becomes. Even when bandwidth is not the bottleneck, according to the DjVu project, DjVu files render faster than PDF, PostScript, or most popular graphic formats. DjVu decoders don’t decompress whole images before displaying them unless absolutely necessary; therefore, they consume less RAM and work better on limited hardware.

DjVuPhoto goes hand in hand with DjVuDocument, which partitions images in layers. This yields high-resolution text which is loaded in the browser quickly, before anything else. You can immediately start reading a scanned page while its pictures and background are still arriving. Since the text is in a separate layer, you can also search it, index it, print it without the background, and load it into optical character recognition (OCR) tools without the images. Text and images can be compressed at different resolutions to reduce a document’s total size without too many compromises on quality.

Black and white text and simple drawings are best handled by DjVu’s DjVuBitonal compression technique. DjVuBitonal also detects when the same identical shapes, such as characters or logos, appear many times in an image, and when compressing large, multi-page documents, it can store these recurring objects just once in shared dictionaries.

Version 3.0 of DjVu supports thumbnails, and saving in two multi-page formats (see Figure 2). In the first (called Indirect), every page is stored, together with a complete index, in a separate file or set of files: you need download only the pages you actually read.

Bundled DjVu files instead pack everything into one archive, which is easier to save to CD or send as an email attachment. Multi-page documents, such as the one containing the 1927 amplifier sketch, are rendered in a way similar to that used by most PDF readers (see Figure 3).

Depending on the initial content, the resulting files can be 10 to 20% of the size of equivalent JPEGs, or 1/3 to 1/8 as big as PDF files. DjVu’s developers have posted a DjVuPhoto vs. JPEG comparison online. Compared with PostScript, DjVu fares even better in producing compact content, especially when considering color or scanned images and browser compatibility. Furthermore, DjVu provides tools to convert existing PDF and PostScript files to the DjVu format.

Real world examples and use cases

DjVu’s shines when used to preserve and publish historical documents. Many paper documents, from books to maps to sketches and newspapers, are too fragile to be made publicly available. With DjVu, however, they can be browsed from the Internet, or at least inside a library intranet, without fear of harming the original source materials.

The VESTNORD digital library already serves more than 800,000 pages of newspapers and magazines from the Faeroe Islands, Greenland, and Iceland in this way — some from as far back as 1773. Many other Web sites already use DjVu. Visit the DjVu demo gallery to get an idea of what types of materials are already available.

However, DjVu is not just for newspaper archives or large organizations. It may be a good choice to store your own birth certificate on a hard disk, for example. Online shops could use it to allow customers to look more closely at each item. Mathematicians, artists, or anyone who owns years’ worth of complex, hand-written material filled of formulas or drawings could use DjVu to publish it online. Personally, I’d like to see classic comics distributed in this way.

DjVu developers’ corner

DjVu is an open standard. The project’s Web site hosts the file format specification, as well as an open source implementations of the decoder and part of the encoder.

If you want a more elaborate package, LizardTech offers commercial DjVu products and support. DjVuLibre is a GPLed implementation of DjVu. DjVuLibre includes viewers, browser plugins, decoders, simple encoders, and utilities. A tutorial on DjVuZone.org provides lots of information, from how to host DjVu material on a Web server, to the creation of DjVu files containing hyperlinks and other metadata.

DjVu plugins are available for standard Web browsers on Linux and most versions of Unix, Windows, and Mac OS. All the screenshots in this article come from Firefox 1.5 on openSUSE. As you can see in Figure 4, the plugin toolbar supports printing, zooming, and more, straight from within the browser. You can configure plugin behavior within the browser. After installation, you only have to link the right file from the Firefox plugin directory in order to have Firefox load it the next time you start it, with a command similar to this:

cd /usr/lib/firefox/plugins/; ln -s /usr/lib/browser-plugins/nsdejavu.so .

Another tool worth mentioning is Any2DjVu, a free conversion service that allows you to use or test DjVu in a two-step process. After choosing the starting format (PDF, PostScript, TIFF, JPEG, or many more), you’ll get a form which will allow you to tweak the conversion process (perform OCR, define background quality, etc.) and upload the file to be converted.

Unfortunately, some DjVu interface functions and optimizations are not available under free licenses. Tools such as the PostScript/PDF converter, for example, come under GPL-incompatible licenses.

Still, DjVu can allow free software users to enjoy documents they wouldn’t be able to access online otherwise, and nothing prevents people from storing and distributing the originals in other formats too.

Maybe DjVu’s greatest value is not that it makes scanned documents quickly viewable from the Internet, but rather that it makes us realize that almost nothing of our paper memory is already on the Internet in the first place, regardless of the format.