February 21, 2005

CLI Magic: HTML Tidy

Author: Joe Barr

I've been writing online for years, but there are still things I need help with in order to make my words production-ready. Not just the grammar and spelling, mind you, but with standards-compliant HTML. This week we'll take a look at a wonderful command-line tool that fixes my HTML errors and also makes it pretty. It's called HTML Tidy. Shut down OpenOffice.org or whatever word processor you've been using to generate GUI-contaminated HTML, and let's take a look.HTML Tidy is the creation of Dave Raggett, who has been
involved in the development of HTML itself almost from the
beginning. At present, he is working on voice-controlled browsing
at W3C. These days, the HTML Tidy project is maintained by a group of developers in
order to provide a central depository for patches, but to extend
the project's reach by providing a library of HTML Tidy
functions.

Here is a sample of the Pidgin-HTML I typically create when
writing an article. It's very bare-bones, as you can see.

<b>THIS IS RAW TEXT</b>
<p>
This is a paragraph of text. No HTML at all. Blah, blah, blah.
<p>
This is a second paragraph, it is much more interesting than the
first one, because it contains commas as well as periods. Just
imagine.
<p>
This is the third and final paragraph. Like the others, it is
bereft in its native form of any sort of HTML to indicate the
breaks in text.

Now let's run that raw text past HTML Tidy, like this:

tidy raw.txt

HTML Tidy responds with:

Tidy (vers 4th August 2000) Parsing "raw.txt"
line 3 column 1 - Warning: inserting missing 'title' element

raw.txt: Document content looks like HTML 3.2
1 warnings/errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy, see
www.w3.org">
<title>>/title>
</head>
<body>
<b>THIS IS RAW TEXT</b>

<p>This is a paragraph of text. No HTML at all. Blah, blah,
blah.</p>

<p>This is a second paragraph, it is much more interesting
than the first one, because it contains commas as well as periods.
Just imagine.</p>

<p>This is the third and final paragraph. Like the others, it
is bereft in its native form of any sort of HTML to indicate the
breaks in text.</p>
</body>
</html>

HTML & CSS specifications are available from
http://www.w3.org/
To learn more about Tidy see
http://www.w3.org/People/Raggett/tidy/
Please send bug reports to Dave Raggett care of
<html-tidy@w3.org>
Lobby your company to join W3C, see
http://www.w3.org/Consortium

Run naked at the command line, with no arguments given other
than the name of the file to examine, HTML Tidy behaves as shown
above: dumping errors, comments, and revised text to the console.
But there are number of switches which can be used to modify that
behavior.

But before we get into the options, let me point out the basis
for my initial attraction to Tidy HTML in the first place. It's
because of two things. The first is whitespace. Notice how in the
tidied-up version there is a blank line (whitespace) between the
paragraphs. That means that when another human edits the text,
their eyes will have the content fed to them in bite-size, focused
chunks. As a programmer for lo these many years -- can it be 30
already? -- I appreciate clean, easy to read code. And HTML.

Which brings me to the second attraction. My dark side,
HTML-wise. I never close <p> tags. I think it is a stupid,
meaningless practice which only serves to clutter up and obfuscate
the text. But you know how standards are. Anyway, Tidy HTML will
not only close those tags for me, it does it without losing the
whitespace.

Now let's look at some options:

  • -i indent the contents of elements
  • -o omit optional endings
  • -u use UPPER CASE for tag names
  • -m modify in place (original file is changed)
  • -q quiet by suppressing welcome message and summary
  • -asxml convert HTML to XML

Please note that there are errors which Tidy HTML cannot
correct, some may be serious enough that it doesn't even try to
proceed. In our example here, HTML Tidy warned us about the missing
TITLE tag, and inserted the tags, but of course didn't have any way
of knowing what it should be. In other words, you should read the
warnings and errors found and if necessary, correct them manually
before proceeding.

Click Here!