March 1, 2006

Viewing Word files at the command line

Author: Scott Nesbitt

As a Linux user, there are times when you have to play nicely with users of Windows or Mac OS -- such as when they send you Microsoft Word files. When you receive a Word file, you can either follow Richard Stallman's advice and refuse it, or bite the bullet and work with it. Modern Linux word processors -- such as OpenOffice.org Writer, AbiWord, KWord, and TextMaker -- can deal with most Word files. But if you don't want to fire up a word processor in order to read or print the document, you can turn to the command line. A handful of small but powerful Linux command line utilities make viewing, printing, and even converting Word files to another format a breeze.

Antiword

Antiword is a nifty application that can convert Word documents to plain text, PostScript, and PDF. According to the developer, conversion to DocBook XML is still experimental and doesn't always work well.

Antiword is very flexible. It can read and convert files created with Word versions 2.0 to 2003, and you can run it on multiple operating systems, including Linux, Mac OS X, RISC OS, FreeBSD, and OpenVMS. On top of that, you can set the paper size for documents converted to PostScript or PDF, include any text that was removed from the file (but which Word notoriously keeps a record of), and display any hidden text.

For the most part, you'll just want to view a Word document. To do that, you just have to type the following command:

antiword file.doc

The Word document will be converted to text and printed to the screen. If you're running Antiword in a terminal window, you'll have to scroll up to view the full text of the document. To get around this, you can pipe the output from Antiword to the less utility, which will allow you to scroll through the document page by page from the top:

antiword file.doc | less

Catdoc

Slightly less flexible than Antiword, but still useful, is Catdoc, whose developer explains that "it does same work for .doc files as the Unix cat command for plain ASCII files."

While Antiword tries to retain some of the formatting of a Word file, Catdoc is a quick and dirty tool. It outputs either LaTeX or plain text, and little else. The LaTeX output leaves a lot to be desired -- it does nothing beyond adding the LaTeX formatting for tables or special characters. You'll have to add the LaTeX preamble and any other formatting code yourself.

Catdoc has some rudimentary support for tables. If it's converting a simple table, the output will be passable. If the table is more complex, say with nested elements, it won't be pretty.

To run Catdoc, type the following command:

catdoc <output_format> filename.doc

You can specify the output format using the -a (text) or -t (LaTeX) option. So, to convert the Word file whitepaper.doc to text, type:

catdoc -a whitepaper.doc

As with Antiword, you can pipe the output from Catdoc to the less utility.

wvWare

wvWare is part of of wv, a library of that enables developers to code software that can read and write Word files. In fact, both AbiWord and KWord use wv for importing Word documents. wvWare can handle documents created with Word from version 6 to 2000. It converts Word 2.0 documents to text only.

Used by itself without any command line options, wvWare will convert a Word document to HTML and display the code on the screen. If you want to write the HTML to a file, use the following command:

wvWare file.doc > file.html

But you're not stuck with HTML. wvWare comes with a set of scripts that can convert Word files to a number of other formats, including plain text, HTML, LaTeX, PDF, PostScript, LaTeX DVI, and WML. These scripts are usually installed in the folder /usr/bin. You can get a list of them by typing ls /usr/bin/wv* at the command line.

If you want to convert a Word document to text, use the following command:

wvText file.doc file.txt

I've never been able to pipe the output to the less utility or a text editor. I've always had to open a file converted with wvWare in an editor or browser.

you can view Word files using wvWare and the w3m text-mode Web browser, as detailed in the book Linux Desktop Hacks . I've tried this hack with the text-based browsers Lynx and Links as well, but w3m does the best job out of the three.

To use this hack, type the following command:

wvWare -x /usr/lib/wv/wvHtml.xml file.doc | w3m -T text/html

You can also encapsulate the above command in a script if you decide to use this hack regularly. wvWare converts the Word file to HTML using the configuration file named wvHtml.xml, then pipes the output to the w3m browser.

A gotcha or two

While Antiword, Catdoc, and wvWare do a good job handling most Word files, you might run into documents that don't want to cooperate with you. I've found that these utilities sometimes can't process documents that are saved with Word's Fast Save feature, which quickly saves a file by tacking any changes to the end of the file. For example, Antiword might display the cryptic message The Small Block Depot is damaged when it encounters a Fast Saved file. This doesn't happen with all Fast Saved files, however.

As well, out of the box these programs might display garbage characters when converting Word files that use non-Latin character sets or that contain graphics. Check the documentation for the program that you're using for information on how to deal with character sets and graphics.

You don't need a word processor to view Microsoft Word documents on Linux. With the right command line apps, you can view or print those files in a flash with just a few keystrokes.

Scott Nesbitt is a Toronto-based technical writer and journalist who is a big fan of useful little command-line utilities.