Opportunities for Open Source software in the publishing industry

26
– by Chris Gulker
The modern publishing industry can be said to have started in 1455 when Gutenberg’s first Bible was printed. Since that day publishers have been looking for cheaper ways to get things done, and today they are clearly interested in Open Source. At the Seybold publishing conference in San Francisco last week, we were surprised to find Open Source alive and well in four of the first six booths we visited.

Publishing — by firms that produce newspapers, magazines, books of all kinds, and even corporate documents — is a very well understood business where the leaders are firms who have cut costs to the absolute minimum and exist on very thin margins, thanks to intense competition from other publishers and other media, including, nowadays, the Internet.

The print publishing industry has pretty much standardized on a handful of creation tools. While there are good Open Source tools like The GIMP that have a lot to offer, most people in publishing won’t consider changing their current tools and platform. They have too much invested in training, and most print production jobs advertise for specific skills in applications like QuarkXPress, a page layout package, Adobe Photoshop, an image editing application, and Adobe Illustrator for vector graphic manipulation.

Print people are also often too busy to learn new software — modern publishing workers are expected to be very productive. They have automated lots of their daily chores using scripting, but it’s not often Perl: an amazing amount of this work is done on Macintoshes, and AppleScript is probably the most-used automation tool. Chicago-based R.R. Donnelly, one of the world’s biggest printers, has literally millions of lines of AppleScript, and a coding department to maintain them.

As it was in the beginning …

It’s not widely known, but Gutenberg’s first Bible was actually printed by a venture capitalist after he’d repossessd Gutenberg’s first press: and therein lies a tale for modern Open Source developers looking for opportunities in the multi-billion-dollar publishing technology market.

The venture capitalist, Johann Fust, had invested 800 guilders in Gutenberg’s startup and was eager to see a return after a couple year’s work resulted in a working press. Gutenberg, however, discovered that his fonts, which could fit 42 lines on a page, were too big, and he couldn’t print books any more cheaply than they could be had from the competition — monks who copied out the books in longhand.

Gutenberg wanted to make the fonts smaller, so he could use fewer pages for the same length book, even though it meant essentially starting from scratch on the fonts. Paper was handmade at that time and constituted one of the largest expenses in bookmaking. Incidentally, a book, in 1455, cost the equivalent of about $300 in modern dollars.

Fust wanted Gutenberg to ship now, regardless. And he had made it a condition of his investment that Gutenberg hire Fust’s son-in-law (some say brother-in-law) as an assistant. So, Gutenberg went on to “pursue other interests,” and Fust and Peter Schoffer, the son-in-law, actually printed “Gutenberg’s” first 42-line Bibles.

Two of the growth areas in publishing technology are digital asset management and production automation. Publishers are eager to wring every last dollar out of the materials they create and it’s usually much cheaper to create new products out of work that’s already on hand than to start from scratch. An example would be a magazine company repackaging related articles into a special edition. One problem is finding the material — large companies like AOL Time-Warner have literally billions of files on hand, often spread around plants and servers all over the world. Another problem is automating the repurposing process — it can be just as time-consuming (read “expensive”) for human operators to reformat content by hand from, say, print to HTML as it is to create the stuff in the first place.

To offer help with digital asset management, Google was at Seybold this week pitching its Search Appliance, a bright yellow, 1U rack-mount server that is designed to be dropped into a file server farm, where it will handle the chore of indexing everything it can find. Google publicist Nathan Tyler says the machine runs a custom version of Linux, derived from Red Hat, and optimized to run Google’s proprietary search algorithms.

So much for finding stuff, although one wonders if Nutch, the recently announced Open Source search engine project, may result in code that could be adapted to make custom search appliances that run on commodity hardware. There could be an opportunity for developers to install and customize these appliances for large media companies or to develop their own products, possibly at lower cost than Google and other proprietary search vendors.

In the area of production automation, two companies that use Open Source to make it easier to repurpose content are Exegenix and Innovation Gate. Both companies build proprietary XML products on Open Source technologies like Tomcat, Apache, and Linux. XML, by the way, has been a kind of Holy Grail in print publishing for years. Publishers know that if they can tag content for what it is — headline, byline, body text, etc. — it becomes a lot easier to automate the repurposing process. Unfortunately, the products they have used for more than a decade only relatively recently added capabilities to tag content for what it is — e.g. “body copy” — rather than for what it should look like — e.g. “12-point Times Roman.”

Hand conversion of giant archives is regarded as impossibly expensive. So Exegenix and Innovation Gate offer products that take files in a wide variety of input formats, plug them into XML as best they can, then display the results to operators who can catch and fix problems. Once the files are in XML, operators can set up templates that automate the creation of HTML pages, PDF files, and new print pages much more quickly than formatting each by hand. In the case of the similar classes of documents found at insurance companies and financial services firms, the time savings can be huge. Innovation Gate, which runs on any platform that supports J2EE, and Exegenix, which runs natively on Win32 and Linux, both have rosters of corporate clients who use their products to reduce costs and speed work flow.

Artifex is one of the oldest successful publishing-oriented Open Source developer shops. According to CTO Raph Levien, Artifex has been developing products based on Ghostscript since 1988, and claims more than 80 OEM customers, including IBM, HP, Macromedia, and Xerox. Artifex offers a very complete set of graphics libraries that let other products — everything from software to ink jet printers — render pages from PostScript, PDF, PCL, and other formats. The core company has 10 people, and often taps contractors and volunteers from legions of Ghostscript developers all over the world for specific projects.

There are more opportunities: the raster-image processors used by printers and plate-makers have traditionally been expensive, proprietary software running on proprietary Unices like Irix. No one has yet managed to build a general purpose publishing workflow system that has attracted more than a point or two of market share, even though the publishing industry is standardizing on an XML format called job definition format. Proprietary systems have in the past been too expensive and inflexible, and publishers have been burned by vendor lock-in strategies. There may well be opportunities for MySQL and PostgreSQL developers to learn how to apply their skills to workflow systems.

Open Source developers can also find lots of niche opportunities — publishing is a huge and varied field, and these customers will listen to developers who can save them money. Where Gutenberg failed, an Open Source developer may well succeed.

Chris Gulker, a Silicon Valley-based freelance technology writer, has authored more than 130 articles and columns since 1998. He shares an office with 7 computers that mostly work, an Australian Shepherd, and a small gray cat with an attitude.