DocVert Can Handle All Your Document Conversion Needs

122

Thought you’d be living in a Microsoft Office-free world by 2011? Unless you’re in a Linux-only shop that does business only with other Linux-only shops, the chances are that dream remains a few years away, and you still have to drag out an office file converter periodically. The trouble is, each free software office suite has its own, and they vary in their capabilities. Enter DocVert, a worthy GPLv3-licensed utility to keep at the ready, thanks to its choice of CLI- and Web-based interface options, and its flexible output formatting.

DocVert actually started out as a Web application back in 2004, and earned some success in that form. Users could upload a word processor file, hit the convert button, and get properly-formatted HTML back in short order. If you think dealing with .DOC files in Abiword or LibreOffice is painful, just ask a Web developer, many of whom still have to deal regularly with clients that deliver content in .DOC format. The free software office suites don’t offer much in the way of Web-centric conversion tools, and recreating content by hand is a colossal pain.

Development continued on a relatively steady pace for the next few years, with new features added incrementally and updates to match format changes. But in March of 2011, maintainer Matthew Holloway undertook a complete revision, re-writing the core in Python alone, without any PHP dependencies. The result of the effort is the 5.x series, which can be run as a Web service, or accessed at the command line.

Setup

The program uses XSLT to transform documents from one form to another, with a distinct “pipeline” for each document type desired — LibreOffice ODT, HTML, plain XML, even DocBook. The newest release is version 5.1, which dropped on August 16, 2011. It is mainly a bugfix release, including improved handling of internal links such as footnotes.

You can download the release as a tarball from the project’s Web site, or you can grab the source from the project’s Git repository and pick up more recent changes. In an email exchange, Holloway told me that direct ODT output is only supported in the Git version, so that is what I downloaded. It should be ready for the 5.2 release in September. Because the application is written in Python, there is no installation process required, but you must make sure that you manually fulfill the dependencies. DocVert requires Python 2.6, the LXML and pdf2svg packages, and the Python-UNO library.

UNO (Universal Network Objects) is the runtime library offered by OpenOffice.org and LibreOffice. DocVert uses it to call those application’s internal conversion routines. If you are not sure which to install, the safer option is to go with the LibreOffice version. Holloway is using LibreOffice for development, and OpenOffice support could go away in the future.

Usage

From the command line, cd into the directory where you downloaded DocVert, and type python ./docvert-cli.py --list-pipelines. This will return a list of all of the available conversion pipelines, which have simple names like “docbook,” “open document,” and “web standards.” You can convert a single document with python ./docvert-cli.py -pipeline pipelineNameinputFile.DOC. DocVert will generate your output (either in a single file for ODT or in a folder for HTML) and report back its name on stdout.

DocVert in ActionThe “basic” and “web standards” pipelines are both used for HTML output. Basic just has more relaxed HTML formatting rules. If you are interested in HTML output, you can also add the --autopipeline argument to your command, specifying either of the supplied themes afterward. The current version includes two: one that creates a single long page as the output, and one that breaks up the document into multiple HTML pages, separated at the headings found inside the document.

For single-file usage, this method works great, and you can use it to test DocVert’s compatibility with the wild-and-crazy complexities of real-world Word documents. DocVert supports not only the basic formatting you might find in a simple letter or paper, but complex structures as well, including embedded images, tables, nested lists, and footnotes.

But if you need to convert a large collection of documents, you might find the Web service invocation faster. You start DocVert in this mode with python ./docvert-web.py -p PortNumber. If you leave off the port number, DocVert will default to port 8080, running on localhost. The Web UI allows you to select a pipeline and upload a file, and it returns the converted output to you in a Zip file.

But the real advantage of Web mode is that it implements a full REST API. The script responds to HTTP POSTs that contain word processor documents, and sends a .ZIP file in reply. That makes the interface scriptable, and in most cases more easily than the command-line tool. If you are not ready to install and run the Web version yourself, you can also test out the project’s live demo on the Web site.

How the Magic Works

I tested DocVert on old Word Doc files from emails and miscellaneous files from HR departments (none of my real co-workers have sent me a .doc file in years). I wasn’t able to trip it up, although it is worth noting that DocVert does not attempt to capture and reproduce the visual features of the document, such as encoding the exact font used.

The clever bit is that in 99.9 percent of cases, that’s fine. You rarely if ever need to preserve the font or the margin, or other Word-specific markup features, and it is exactly those elements that cause most Word Doc to HTML converters to produce such horrifyingly-bad HTML output. DocVert, in contrast, converts the document to an intermediary format that preserves its overall structure: headings, relative positioning, indentation, and so forth. It then uses XSLT to transform between the intermediary format and the selected final output, in a clean and predictable way.

In fact, if you know XML and XSLT, you can even write your own transformations. A good place to begin is with studying the pipelines that ship in DocVert 5.1. At the moment, due to the rewrite, there is only a basic list of output formats and variations supplied. But Holloway reports that the migration to pure Python amounts to a 300% speedup compared to the mixed Python/PHP version before, which is worth it in my estimation.

Expect more pipelines to come as the 5.x series matures; earlier versions of the application supported a wide range of outputs, including S5 HTML slideshows, and even a “reverse” HTML-to-OpenDocument transformation, which Holloway describes as a tool to help you migrate content out of a bad legacy CMS. In short, if it can be represented in XML, DocVert is capable of consuming it and of generating it.