September 29, 2004

XML-based print publishing with DSSSL and OpenJade

Author: Jonathan Bartlett

XML publishing has been thrown around as a catchphrase in the industry quite a bit. What does it really mean? Most XML-based publishing happens entirely on the Web, although great tools exist for doing print publishing with XML. This two-part article will guide you through the details of print publishing with XML, using tools that come standard with most Linux distributions.

What's so great about XML publishing? You can completely separate the look of the document from the content of the document. As programmers would say, you are separating the model
from the view. XML by itself doesn't do this for you -- you have to come up with a reasonable tagset yourself that adequately represents the content, not the style, of your document.

Why go to all the trouble? Flexibility. The documents we will discuss are group study guides, with content for the leader's guide as well. Separating content from presentation allows us to generate a student guide, a leader guide, and later a compendium, all from the same content files. Simply applying a new stylesheet modifies the content's format to fit whatever kind of document we want to produce. In addition, we can also have Web-optimized stylesheets if we want to copy the material to the Web.

Print-specific issues

Many people who use XML for the Web already may wonder what is so different
about print publishing. There are several differences:

Page size: Your document has to be optimized to fit whatever page size you want to print to. This includes forcing and preventing page breaks at certain locations. Decide what page size you want to have based on your target market and your printing budget.

Page sides: On the Web, pages only have one side. In print, however, pages have two sides, and they often need to be styled differently. In addition, some content belongs on certain page sides. Chapters, for instance, usually start on odd-numbered pages. Also, the location of header and footer items changes depending on whether you are printing an odd-numbered or even-numbered page.

Cross-references: On the Web, cross-references within a document (including tables of contents and indexes) are handled by a link, while in print you have to include either the section number or the page number.

Color: It's much cheaper to print in black and white than it is to print in color. Therefore, for your copy text, you want to make sure that all layout is done in black and white. You will likely want your cover to be in full color, but color in the print world is much different from color on RGB monitor screens.

Resolution: Web text and graphics usually appear at a low resolution, such as 72 or 96 dots per inch (dpi). However, in the print world, that resolution gives substandard quality; you will usually want 200 to 300 dpi.

Before you begin creating XML-based documents for printing, you'll need to have the following software installed on your Linux system:

  • OpenJade
  • Jadetex
  • Ghostscript
  • xmltex
  • teTeX
  • vi or emacs
  • The GIMP

Writing the content

You may have thought that the first step would be writing the document type definition (DTD). Maybe in academia it is, but in the real world you won't really know what tags you'll use before you use them. Luckily, XML doesn't force us to have a DTD. Simply use tags as you see the need for them. Some notes on tag creation:

  • Most documents have a root tag, with a title right under it. Be sure to name your root tag somewhat uniquely, so you won't cross namespeces if you later want your document to be part of a collection.
  • Using shorter tags makes it easier to type.
  • Use standard HTML tags when it makes sense. Almost all of my documents contain p, ul, li, ol, and title.
  • Don't use HTML's header tags. You should block off your document into sections. For example, instead of having an h1 tag, you should have a mainsection tag that spans the entire main section, and then a title subtag that has the actual title. This makes your document reflect the content rather than the presentation.
  • For inline markup, use descriptive names, not just typesetting names. Remember, we're going for content not presentation. For example, use booktitle, not underline.
  • You can use attributes in your tags, too. For example, when doing quotations, I usually do something like this: <blockquote author="..." publisher="..."> and give the full bibliographic data within the tag. I can add or leave out as much of the data as I want at the site of the quote, and create a bibliography at the end out of all of the quotes.

You'll probably want to be able to view what you're typing as you type it. For this reason, I usually include a basic stylesheet so that I can view my work in my browser and print it out for others to mark up. To do this, link a CSS stylesheet in like this:

<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/css" href="basic.css" ?>
<!--Document Goes here -->

That weird almost-tag <?xml-stylesheet> is what's called a processing instruction. It is not part of your marked up content, but instead provides signals to certain processing engines and gives them additional data about how to process your document. Modern Web browsers are supposed to use the xml-stylesheet processing instruction to locate stylesheets for your document.

CSS is easy to learn, but since this article is about print publishing, we won't go into depth in CSS. However, here is a short example document and a stylesheet that lets you display it.

The document:

<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/css" href="basic.css" ?>
<?xml-stylesheet type="text/dsssl" href="basic.dsl" ?>
<groupstudyguide>
<title>Discipleship</title>
<author>
<firstname>John</firstname>
<lastname>Doe</lastname>
</author>

<aboutthisstudy>
<p>
This study goes into depth of how to code XML documents.
</p>
</aboutthisstudy>

<lesson>
<title>Overview of XML</title>

<leadernotes>
<p>
Leaders, use this lesson to familiarize students with the basics of XML.
</p>
</leadernotes>

<p>
Document text here...
</p>

<p>
Document text here...
</p>

<homework>
<p>
Homework here
</p>
</homework>
</lesson>

</groupstudyguide>

Copy, paste, and save this document as study.xml, as we will be using it as the example
for the rest of the lesson. You'll notice that we have a second stylesheet processing
instruction in there as well -- that will be for our DSSSL stylesheet when we get to it.

The style sheet, basic.css, looks like this:

groupstudyguide {
	display: block; /* elements in CSS can either be "block" or "inline"
	margin: 20px;   /* Put 20 pixels of space on all sides of the document */
	font-family: "verdana", "helvetica", sans-serif;
	font-size: 12px; /* The base font size should be 12 pixels high */
}

groupstudyguide > title {
	display: block;
	font-size: 18px;
	font-weight: bold;
	text-align: center;
}

author {
	display: block;
	text-align: center;
}

aboutthisstudy:before { /* Using :before allows you to insert content before an element */
	display: block;
	font-size: 16px;
	font-weight: bold;
	content: "About This Study";
}

aboutthisstudy {
	display: block;
}

p {
	display: block;
	margin-bottom: 20px; /* put space after paragraphs */
}

lesson > title {
	display: block;
	font-size: 14px;
	font-weight: bold;
}

leadernotes:before {
	display: block;
	font-weight: bold;
	content: "Notes for leaders:";
}

leadernotes:after {
	display: block;
	font-weight: bold;
	content: "End of Notes";
}

homework:before {
	display: block;
	font-weight: bold;
	content: "Homework: ";
}

Your stylesheet can be much fancier, but this is enough to get you started.
Luckily, the CSS standard is a really easy to read, and is very well organized.
Mozilla and Opera implement CSS very well, but Internet Explorer doesn't, and both IE and
Konqueror have trouble with before: and after: content.

Getting started with typesetting

Once you have your content written, it's time to put it into book form.
To do that we use a little-known language called Document Style Semantics and Specification Language (DSSSL). This language, which is based on the Scheme programming language, is currently used mostly for typesetting Docbook documents using Norman Walsh's stylesheets, but it works equally well with any set of tags using your own stylesheets.

The language is fairly easy to learn. DSSSL is based on flow objects, or more specifically, specifications of sequences of flow objects, called sosofos for short. Basically, you specify
using built-in flow objects how your tags should be laid out, and the DSSSL engine converts your specification into a print specification. OpenJade, the software that implements the DSSSL standard (Jade stands for James Clark's DSSSL Engine), uses TeX on the back end for print typesetting.

Let's look at a simple DSSSL stylesheet for our document. Copy, paste, and
save it as basic.dsl.

<!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">
<style-sheet>
<style-specification>
<style-specification-body>

(define default-font-size 12pt)
(define heading-font-size (* default-font-size 1.5))

(element (groupstudyguide)
	(make simple-page-sequence
		right-margin: 1in
		left-margin: 1in
		top-margin: 0.5in
		bottom-margin: 0.5in
		page-width: 8.5in
		page-height: 11in
		(process-children)
	)
)

(element (groupstudyguide title)
	(make paragraph
		font-family-name: "Helvetica"
		font-size: heading-font-size
		line-spacing: 12pt;
		space-before: 20pt
		start-indent: 0pt
		quadding: 'center
		(process-children)
	)
)

(element (author)
	(make paragraph
		font-family-name: "Helvetica"
		font-size: default-font-size
		line-spacing: 12pt
		space-before: 6pt
		start-indent: 0pt
		(process-children)
	)
)

(element (lesson title)
	(make paragraph
		font-family-name: "Helvetica"
		font-size: heading-font-size
		line-spacing: 12pt;
		space-before: 20pt
		start-indent: 0pt
		(process-children)
	)
)

(element (p)
	(make paragraph
		font-family-name: "Helvetica"
		font-size: default-font-size
		line-spacing: 12pt
		space-before: 6pt
		start-indent: 0pt
		(process-children)
	)
)

</style-specification-body>
</style-specification>
</style-sheet>

We have three basic constructs shown here: define, element, and make. These are processed as follows:

define is used to define constants and functions used throughout the style specification. Constants and functions are used in DSSSL to maintain consistency throughout the stylesheet, and make it easier to modify. For example, we can define the other font sizes in terms of a base font size, and then we can easily produce a large-print edition by simply modifying the size of the base font. Everything else will fall in line. The syntax for the definitions is that of the Scheme programming language.

element is a selector. When the elements of the document match what's in the parenthesis after the elment declaration, that construction rule is fired. In the first instance, when the DSSSL processor hits the groupstudyguide tag, the (element (groupstudyguide)) rule is fired. If you have multiple tags in the parentheses, the rule's action will fire only if all of the tags in the parentheses match the tag hierarchy in the document in the order listed in the parentheses.

make is called a construction rule. It creates a sosofo of the given type. make paragraph makes paragraph flow objects, while make simple-page-sequence makes page sequence flow objects. Other flow objects include sequence, box, and display-group. The attributes of a flow object are specified with a keyword list, with each flow object having its own set of keywords. paragraph is the most common flow object, since it is used both for paragraphs and just about all block-level sequences of characters. After the attributes, you specify the children of the flow object. The (process-children) function processes all of the subelements and text within the current tag, and makes them children of the current flow object.

Now, to generate the printed document as a PDF file, run the following commands:

openjade -t tex xml.dcl study.xml
pdfjadetex study.tex

You will produce a file called study.pdf, which you can view with any PDF reader. You will also produce lots of auxilliary files, but you can safely delete all the other files.

Congratulations -- you have now typeset your first document! Play around with
the settings and see if you can modify the styles without breaking it. Change the page-width and page-height parameters; there are few books that are actually 8.5 by 11 inches. One of the
nice things about DSSSL is that you can conserve paper by printing your draft
versions on 8.5 by 11 full size, and then just adjust two lines of your DSSSL file
when you are ready to hand your document off to the printer!

Now that you've mastered basic typesetting with DSSSL, it's time to get fancy. We'll cover some more advanced concepts in the second and final part of this article, and look at getting our documents professionally printed.

Jonathan Bartlett is the director of technology for New Media Worx and is the owner of
Bartlett Publishing, a
Linux-based independent publisher. Jonathan's latest book is
Programming
from the Ground Up
, an introduction to programming using Linux
assembly language.

Click Here!