April 2, 2010

Plain Text, Archiving, and Presentation Fidelity

My introduction to computers was as a hobby. My first computer had Microsoft Works for MS-DOS version 1.05 installed on it. Among other things, I decided to use the computer to keep a journal. It was perfect. I could do all my writing on the computer and even edit the text without wasting paper.  

What was even better, I could store my journal on floppy disks, which are more durable than paper.  

Several years later I decided to open and read the journal I wrote on that first computer. I no longer had Microsoft Works, and the word processor I was using by then could not open my journals. I learned my first lesson in proprietary file format lock-in that day.  

I failed to consider the long-term consequences of storing documents in proprietary formats, or even to consider formats at all, really. I sacrificed those concerns in favor of the editing and storage efficiency of computer-based vs. paper-based documents.  

I immediately searched for a format that could work across applications and operating systems (I was using OS/2 by then). I tried various formats with varying degrees of failure. The only format that worked 100% of the time was plain text. It was also clear to me that this format was likely to continue working well into the future, because it had already been in use from well into the past (in computer time).  

It was also the only format that worked with 100% of the programs that offered text editing capabilities. It didn't matter if I used a word processor or a text editor, and it didn't matter if I used a Microsoft operating system or one from some other vendor.  

I recognized even then that the file format of Microsoft's office software chained its customers to its platforms and that entrusting data to proprietary file formats put it at risk. I refused to use word processor formats for anything I wanted to preserve long-term.  

But this created other problems. If I wanted to print a document, I still had to use a word processor, and if I wanted to store that document long-term, I had to keep it in plain text. I was not aware of typesetting systems like LaTeX at the time. So, I saw only two options. I could keep two versions of my document on the computer.  One could be plain text and the other a word processing file format of some kind. Or, I could keep a plain text version on the computer and a paper copy.  

The first option had the advantage of allowing me to store the document entirely on electronic media. But this still had one major drawback. The visual representation of the document could not be preserved long-term.  

The second option had all the disadvantages of paper documents, with the added drawback of separate storage for the plain text computer file from its printed version. But, its visual representation was far more durable.  

The long-term office document storage problem is now being addressed by Open Document Format. In my opinion, it still has not proved itself an equal of plain text in solving that issue, let alone the issue of cross-application fidelity. Until it solves the second issue, long-term preservation of the printed appearance of office documents will remain out of reach.  

Portable Document Format helps address this issue, but it fails to address others. Paper manuscripts often have notes in the margins, stricken text and other, additional information attached to them that PDF documents cannot preserve. Word processor files are better at preserving these details than even plain text files, unless additional formatting is used that preserves the plain text-i-ness of such files while enabling meta-information to exist inside of them. For these and other reasons, plain text has gained a reputation as an inferior format to word processing file formats among many users.  

But, in recent years the Internet has elevated the status of plain text. The promise of the World Wide Web was that collaborative publishing would be open to all. This vision, held by its original creator, was not realized fully until the invention of wikis and blogs. And these do not rely on the features of word processors, but work with formats that are entirely dependent on plain text. Word processors, with their paper-centric interfaces and output medium, are increasingly becoming obsolete as this new publishing paradigm takes hold.  

But, ordinary authors are not necessarily savvy in the use of HTML and other markup systems used on the World Wide Web. For this reason, simplified markup languages were created that remove the requirement to know HTML in order to use wikis and blogs.  

The problem was that different systems used different markup, and one had to learn different markup on each website for which there was a different markup system in place. This was an added source of confusion.  

To address this issue once and for all, even simpler markup languages and utilities were created that would translate its syntax into HTML and other markup systems. One such utility is txt2tags. Its syntax can be translated into HTML, several wiki formats, and LaTeX, which can be translated into PDF. And it allows embedded comments, which addresses the issue of author notes and other information that is not part of the final document.  

Another utility that partly addresses this issue is Markdown. Markdown borrows from conventions used in email messages and adds additional features to format text. It converts its syntax to HTML. This allows users to create valid HTML documents with a syntax familiar to them from reading email.  

There are other markup systems, such as reStructured Text, that go farther than Markdown does to produce multiple output formats. They all have advantages and disadvantages. In my opinion, txt2tags has the advantages of offering multiple output formats in addition to HTML and is aimed at a wider audience than Markdown or other systems. By storing a txt2tags document with its LaTeX and PDF versions in a single archive, document text, notes, and the visual representation may be preserved over a long period of time. It can also produce HTML and various wiki markup from the same source document. It may not be a perfect solution, but it goes a long way toward that solution.  

These utilities preserve plain text without sacrificing presentation.  They let you have your cake and eat it too, instead of forcing you to choose one or the other.  

Ironically, plain text, the archaic format looked down upon during the rise of the word processor and its potential to lock customers into a single vendor's product, is the format best suited to unseat the word processor from its dominant position. Word processors are a relic from a pre-networked world dominated by printed documents. They are ill-suited to today's instantly-published, Internet-connected, platform-neutral world where a document is more likely to appear on a blog or a wiki than to be printed.

Click Here!