November 29, 2006

From XML to paper with Prince

Author: Keith Winston

Extensible Markup Language (XML) is a general-purpose text markup language often used for data storage or passing messages between applications. There are a number of libraries available for processing arbitrary XML within programs, but fewer options to translate XML it into professional printed documents. Here is one way to get from XML to PDF.

Recently, I faced the task of getting XML out of a PostgreSQL database and into a nicely formatted print document. The data encoded in the XML was a large legal document (more than 1,000 pages). The Document Type Description (DTD) describing the XML was unique and contained elements that resembled, but differed from, XHTML. It also contained images that had been optimized for the Web, adding another wrinkle to problem. The target format was a two-column layout with special formatting for notes, tables, and images.

When tackling a new problem, I always start by searching for open source solutions. My first attempt was to use the Apache Formatting Objects Processor (FOP) system. Apache FOP is a Java program that uses the Extensible Stylesheet Language (XSL) standard to transform XML into PDF. After downloading and trying both the older stable version and the recent beta version, I decided that the XSL transformation syntax was too painful and it seemed that changes would be too time-consuming. I knew many iterations would be required to get the desired output.

I briefly looked into LaTex, but I did not find a template that matched my target format.

I settled on an application called Prince that specializes in converting XML to PDF. While proprietary, it is relatively inexpensive, runs from the command line on Linux and Mac OS X and as a GUI app on Windows, and has many advanced features not available elsewhere. It uses standard CSS to control formatting instead of something like XSL templates or LaTeX markup. In addition to pure XML, Prince can create PDFs from [X]HTML. It supports common image formats such as JPEG, PNG, TIFF, and GIF and a subset of Scalable Vector Graphics (SVG). By default, Prince uses the free Microsoft True Type fonts, available for Linux on SourceForge.

The Prince library can be installed as a normal Linux user. You don't need root access to use it. Prince offers a free, non-expiring demo, packaged as RPM, DEB, and tar.gz.

The path to paper

To get the raw XML, I wrote a script that sequentially read the database and wrote the output as XHTML. This allowed me to insert class tags and id tags for handling special parts of the document. It also simplified testing and allowed me to see how minor changes in the XHTML or CSS affected the resulting PDF.

The translation to XHTML was fairly straightforward. I assigned H1, H2, and H4 tags to the title, chapter, and section headings. I then mapped other XML elements to their corresponding (or best match) XHTML tags. I used the GIMP to optimize the images for hard copy.

To create a PDF using Prince, I used the command:

/path/to/prince --style myformat.css mydata.html

In the above command, the CSS file (myformat.css) controls formatting and the XHTML file (mydata.html) contains the content that is converted to PDF. The output defaults to the name of the data file with a PDF extension (mydata.pdf in the example).

CSS considerations

As I created my CSS file, I picked up some useful tips from the demo CSS file included with Prince. It contained useful examples for print formatting, such as @page settings, automatic page numbering, multi-column, and page-break attributes. I was particularly impressed with the layout using the multi-column setting. Simply changing one number reformatted the entire document with one, two, or three columns, flowing all the text as needed.

I had some issues with wide tables that overflowed a column. The workaround was to either reduce the font for that table using a special CSS class or break out of two-column mode for the table. I also had to break out of two-column mode for the images.

One feature I needed was the ability to print the name of the current chapter in the footer of each page. The Prince FAQ page showed how to assign a value to a variable in CSS that can be used later. The "chapter" class definition shown below stores the text of the chapter title in a variable, and it is printed with the page number in the footer.

#chapter { string-set: doctitle content(); }

@page { @bottom-center
    {
    font: 12pt "Tahoma", serif;
    content: string(doctitle) counter(page);
    vertical-align: top; margin:
    0.3em 0;
    }
}

As expected, my customer wanted to change a few formatting elements. The Prince library made it painless and quick to adjust the CSS to produce the final document.

Using common CSS was a bonus, and allowed me to complete an important project well ahead of my deadline.

Click Here!