Improve your writing with the GNU style checkers


Author: Michael Stutz

The diction and style tools put a GNU face on an old Unix feature. These tools read text input, either from a file or the standard input. diction checks the input at the sentence level, and marks wordy and trite phrases, cliches, and the like, while style works on the overall document, giving a summary of the writing style with a number of readability tests.

Years ago these tools came with AT&T Unix, packaged in a utility set that included similar tools and was called the Writers’ Workbench (WWB). They fell by the wayside and were generally forgotten, but in recent years the tools were rewritten for Linux by Michael Haardt, and eventually became part of the GNU Project.

The GNU versions of these tools are not clones of the old AT&T originals, but they are very similar — and with new innovations, they keep getting better. The GNU versions work in the English and German languages, and some of the new features in the 1.10 series include support for British English and recognition of nested sentences inside quoting.

Checking your choice of words

The diction tool will analyze its input and display any doubled words, cliches, and potentially incorrect wording or phrases enclosed in brackets like [this]. Not everything that diction marks will actually be wrong, per se — for example, since writers sometimes confuse the words “desert” and “dessert,” these words are always marked — but you can use its output as a guide to double-check your writing for common errors.

By default, diction expects its input to be in American English; to specify a language, specify it as an argument to the -L option, according to this table:

en	American English
en_GB	British English
de	German

Among the tool’s newer features is the -b (or --beginner) option, which looks for errors commonly made by inexperienced writers, such as confusing “will” and “shall”:

$ echo "How will the company best utilize it's resources?"| ./diction -b
(stdin):1: How [will] the company best [utilize] [it's] resources?

3 phrases in 1 sentence found.

Normally, diction only encloses suspect words and phrases in brackets, and it’s up to you to figure out what might (or might not) be wrong with the marked text. To output the marked text with diction‘s suggestions for improvement, use the -s option:

$ echo "How will the company best utilize it's resources?"| ./diction -s
(stdin):1: How will the company best [utilize -> use] [it's -> = "it is" or "its"?] resources?

2 phrases in 1 sentence found.

To only search for a particular sort of error, use the -s option and then use grep to filter out the lines that match the suggestion text you’re looking for. To get a list of sentences containing doubled words, for instance, use the -s option and filter out the lines containing “Doubled word”:

$ diction -s termpaper.txt | grep "Doubled word"

The doubled word search is better than a plain grep solution because diction works on sentences, not lines — it catches doubled words even if there’s a newline character in between them. What it won’t match are doubles whose case is different, or a double where the first word is the end of one sentence and the next word is the beginning of the following sentence.

If you give the -d (or --ignore-doubled-words) option, diction will do every check except for the doubled-word check.

With a little customization, diction is an excellent tool for checking documents against local style guides. You can create your own style guide by using as a model the default phrase database file, which is normally stored in either the /usr/share/diction/ or /usr/local/share/diction directories. It’s simply a table where each line contains a target word or phrase followed by a tab character and the suggestion, warning, comment, or suggested replacement text to display for that target. Begin a suggestion with an equals sign followed by a word or phrase to use the suggestion of the latter. Here are a few example lines from the American English database:

 a majority of	most
 accomplished	did
 desert	"Desert" and "dessert" are sometimes confused, to the delight of the masses.
 dessert	= desert
 easier said than done	(cliche, avoid)
 it is apparent that	apparently

Use your custom style file by calling its name with the -f option; diction will use your file in conjunction with the default file, unless you turn the latter off with the -n (--no-default-file) option:

diction -n -f /usr/local/share/diction/ submission.txt

GNU style‘s readability tests

The Kincaid Formula is particularly good for technical writing — it was originally developed for use on Navy training manuals, and like many readability indicators it outputs a US grade level.

The Automated Readability Index (ARI) uses character and sentence counts to determine an estimated grade level; it was developed by the US Air Force.

The Coleman-Liau Formula also outputs US grade level, and bases its readability by a character count.

The Flesch Reading Ease score is a readability index, used by the US government, where lower scores indicate higher difficulty.

Robert Gunning’s Fog Index is roughly based on sentence length and the number of syllables per word, and its output is the approximate US grade level required to immediately comprehend the text.

The Lix formulatests for long words (with more than 7 characters) and outputs a number from very easy (0-24) to difficult (over 54).

Another grade-output test is the easy to computeSMOG-Grading, which is a test based on word “complexity.”

At this time diction only comes with stock phrase files for its three supported languages, but it would be an interesting free software project to build up style files for checking text against the most popular and the better style guides — the Chicago and AP style manuals and Fowler’s Modern English Usage would be great places to start.

Checking your overall document style

Complementing diction, the style tool analyzes all of the sentences in a given document and outputs some facts about its overall readability: the document’s score for a number of readability tests (many developed by the US military), plus sentence counts and word usage information.

The sentence count is like a super wc, showing the number of characters, sentences, and paragraphs, the average word and sentence lengths, the number of short sentences (9 words or less), long sentences (24 words or more), questions, and passive sentences, and which two sentences were the longest and shortest.

The word usage summary tells the number of verbs, with a breakdown by type, and a breakdown on types of sentence beginnings: pronoun, interrogative pronoun, article, subordinating conjunction, conjunction, and preposition.

Regular usage is straightforward. Pipe some text to style or give a file name as an argument. Like diction, you can change the language with the -L option (it currently supports the same three languages), and there are a few other options you can use to get extra output that will display before the report summary (see sidebar).

For example, here’s the command to output sentences with an ARI of 25 or higher and get a style summary on a document written in British English:

$ style -r 25 -L en_GB

GNU style‘s command-line options

-l outputs a list of sentences longer than a certain length. Give the number as an argument.

-rX outputs only those sentences with an ARI higher than X.

-p outputs sentences written in a passive voice.

-N outputs sentences containing nominalizations, or verbs that have been transformed into nouns, such as “judgment” (from the verb “judge”), “deduction” (from the verb “deduct”), or recursively, the word “nominalization” itself (from the verb “nominalize”).

-n outputs sentences either in the passive voice or containing nominalizations.

Check more than just text

GNU diction and style won’t work on rough notes or other unpolished material that isn’t properly capitalized. But while they only take plain text input, they come in handy for other kinds of documents, too — just convert the document in question to text, and send the output over to the tool. For instance, you can check the style of a Web page by dumping the text output of lynx:

lynx -dump -nolist http://localhost/mypage.html | style

Other tools for converting documents to text include deroff, detex, and dehtml.