September 25, 2006

CLI Magic: See changes word by word with dwdiff

Author: Joe 'Zonker' Brockmeier

Unix text utilities were designed primarily for programmers and admins, but here's a little secret: the utilities also work well for writers. Instead of using diff to see changes between programs, I often use diff utilities to see what has changed between one version of an article and another. A few weeks ago, I found dwdiff, and found it works even better.

The dwdiff utility, written by G.P. Halkes and distributed under the Open Software License 2.0, is a front end for diff that displays a word-by-word comparison of files.

The diff utility is great by itself, and fine for programmers, but not as useful for anyone who might want to see a word-by-word comparison, rather than line-by-line.

Let's take a look at two drafts of some regular text, and see how they differ. The first example shows two drafts of the same paragraph that I've run through GNU diff, without any options. The first draft is in a text file called draft1, and the second is in a text file called draft2, so the syntax for diff is diff draft1 draft2, which gives us:

1,6c1,5
< To start with, you may need to install Tomboy, since it's not yet
< part of the stable GNOME release. Most recent distros should have
< Tomboy packages available, though they may not be installed by
< default. On Ubuntu, run apt-get install tomboy, which should pull
< down all the necessary dependencies -- including Mono, if you don't
< have it installed already.
---
> You may need to install Tomboy, since it's not yet part of the
> stable GNOME release. Most recent distros should have Tomboy packages
> available, though they may not be installed by default. On Ubuntu,
> run apt-get install tomboy, which should pull down all the necessary
> dependencies, including Mono, if you don't have it installed already.

That's not very useful when it comes to trying to see what has changed. Because of the way that regular text is formatted, when you make a change in one line of text, the odds are that it will affect the next line, and the line after that, and so on. So, plain ol' diff just doesn't cut it.

Let's take a look at what happens when we use dwdiff instead. The syntax for dwdiff is the same; just run dwdiff draft1 draft2:

[-To start with, you-]{+You+} may need to install Tomboy, since it's not yet part of the
stable GNOME release. Most recent distros should have Tomboy packages
available, though they may not be installed by default. On Ubuntu,
run apt-get install tomboy, which should pull down all the necessary [-dependencies ---]
{+dependencies,+} including Mono, if you don't have it installed already.

Text that has been deleted is enclosed in brackets with a minus sign, and text that has been added has been enclosed in braces with a plus sign. Dwdiff's output is much easier to read when you're working with prose rather than code.

Another way to view the text is to use the --less-mode option, which will provide markup suitable for viewing in less. Instead of using brackets and braces to denote changes, text that has been added will be displayed in bold, and text that has been removed will be displayed with an underscore. This may vary, depending on your terminal emulator -- in my use of dwdiff, the bold text shows up just fine in an xterm and GNOME Terminal, but doesn't show up properly in Konsole.

The dwdiff utility also supports a color mode, which displays redacted text in red and additional text in green. The syntax for color is dwdiff -c draft1 draft2.

In the example, you can see that the word "you" is actually common to both files -- the only difference is that the phrase "To start with," has been removed and "you" has been capitalized because it's now at the beginning of the sentence.

While case-sensitivity is often important for programming, you may want to overlook case when looking at documentation or a short story. To do that, you can use dwdiff's -i option (longer version, --ignore-case), which tells dwdiff to ignore case when comparing words.

Just the facts

I like to see changes in context, but if you prefer to see changes by themselves, dwdiff has the --no-common option, which tells dwdiff to print only the words that have changed between files. The short version of the option is -3. For example, running dwdiff -3 draft1 draft2 provides a much more concise report:

======================================================================
[-To start with, you-]{+You+}
======================================================================
 [-dependencies ---]
{+dependencies,+}
======================================================================

If you'd like to shorten that even further, you can use the --no-deleted option, which tells dwdiff to omit words that were deleted from the first file. The short option for --no-deleted is -1. Using dwdiff -1 -3 draft1 draft2 provides output like this:


======================================================================
You
======================================================================

dependencies,
======================================================================

To omit words added to the second file, you can use the --no-inserted option. The short option for that is -2.

You may be less interested in specific changes than seeing just how much has changed between versions of a file. To see the word count and percentage changed between two files, use the -s option, which produces output similar to this:

old: 1662 words  1597 96% common  10 0% deleted  55 3% changed
new: 1666 words  1597 95% common  13 0% inserted  56 3% changed

This gives us the word count of the original file and the new file, how many words are in common, the number and percent deleted from the original and the percent and number inserted in the original.

The dwdiff utility uses diff, and you can pass some options to diff using dwdiff's -D option to change the way diff behaves -- so long as that doesn't change the output coming from diff. The syntax for this is dwdiff -D-option file1 file2, so the diff option is included directly behind the -D option for dwdiff.

For instance, diff has an option (-y) to display changes side-by-side. This won't work with dwdiff, because it throws off the formatting that the program expects. You could, however, pass options like diff's -d option, which tells diff to use a different algorithm to find fewer changes. This will affect diff's behavior, but not the format of its output.

Note that dwdiff is not the only game in town. The GNU wdiff utility is another front end to diff that produces word diffs between files, and the options for wdiff are similar, but dwdiff has a few different features.

The dwdiff utility is the only word diff utility with color support, and it also provides support for specifying characters to be treated as whitespace or as delimiters. To specify a character to be used as whitespace, which is ignored, use the -W option.

Delimiters are characters that are treated as words, even if they're not separated by whitespace. Specify delimiters using the -d option -- so if you want to specify a semicolon as a delimiter, you'd use dwdiff -d \; file1 file2. In this case, you need to escape the semicolon using the backslash (\) character so that it's not interpreted by the shell.

If you spend any amount of time comparing text files, I'd suggest installing dwdiff and testing it out. It's a handy tool to have alongside the traditional diff utility, and makes spotting changes in text files easier.

Click Here!