December 15, 2008

Condensing with Open Text Summarizer

Author: Bruce Byfield

Properly speaking, Nadav Rotem's Open Text Summarizer (OTS) is not a summarizer at all. True summaries generally involve rewording contents at a higher level of generality while preserving the meaning, not just producing a condensed version of the original the way that OTS does. However, within its limits, OTS is an efficient tool for automatically producing abstracts of non-fiction, that, in the last 15 months, has received favorable mention from at least four academic publications, including one in which it outperformed similar utilities, including commercial ones such as Copernic and Subject Search Summarizer.

OTS is available as a command-line utility in Debian, Fedora, Gentoo, Mandriva, and Ubuntu packages. It is also available as a plugin in the latest versions of AbiWord. A gedit plugin is also being prepared, according to Rotem.

OTS removes common words, such as articles like "the" or "a" or conjunctions like "and" and "but," from consideration by using a dictionary list that accompanies the utility. Conversely, words that occur most frequently in the text are assumed to be the topic, while the sentences that have the highest percentage of the most frequently occurring words are the ones that are used in the output.

For greater accuracy, OTS also references grammatical rules, so that it does not assume, for instance, that the period used to indicate an abbreviation marks the end of a sentence. Similarly, OTS uses the Porter Stemming algorithm so that variants of the same word, such as "run," "ran," and "running," are grouped together in the frequency count. According to Rotem, Porter Stemming is about 90% accurate, which in turn makes OTS more accurate.

Using Open Text Summarizer

You can use the command-line version of OTS for plain text files, including HTML files, although the output for HTML files inevitably includes tags. A complete man page is available on the project site, but the Debian package, at least, does not include it, which means that you have to rely on the command ots -? or ots --help to see the options.

The basic command, ots inputfile, prints the output to the terminal. If you prefer, you can save the output to a file with ots --out=outputfileinputfile.

By default, the output file is 20% the length of the input file, based on the number of sentences in the input file. You can use the --ratio=percentage option to adjust the length of the output.

Adding the option --html produces output in HTML. If you want keywords to use as meta tags, the --keyword option is deprecated, but you can use --about to get much the same result.

You can change the default dictionary of excluded words using --dic=filename. Unless you are an expert in the field, you are unlikely to improve on the dictionary installed with OTS, but you might possibly want to exclude words specific to an area of expertise that you know are unlikely to be the topic of your input passages.

With the AbiWord plugin, you have fewer options, but all you need to do is select Tools -> Summarize, and choose the percentage length of the output file, and the result is entered into a new, unnamed file.

The results

Whatever the form in which you use OTS, the usefulness of the result depends partly on the content of your input file. In general, OTS works well with academic articles and news stories, making it a useful tool for those who need to write abstracts of the sort seen on portal Web sites or annotated bibliographies. You might want to tweak the results to provide a true summary rather than a condensation, but, even so, using OTS requires less time and involves less active thinking than writing a summary from scratch.

With other content, OTS is less successful. In my testing, its results are only fair with fiction, probably because the repetition in fiction does not necessarily indicate the important points. For the same reason, bullet lists of unorganized points do not always condense successfully, and, if you try to summarize a song, a chorus will often be featured in the output at the expense of the content of verses.

Outside of these limitations, Open Text Summarizer performs satisfactorily. It certainly compares favorably to the AutoAbstract feature in OpenOffice.org Writer, which is based -- rather pointlessly, so far as accurate results are concerned -- on style heading levels. So long as you are aware of its limitations, and check the results before you use them, OTS is a minor but useful addition to the arsenal of free software tools.

Categories:

  • Desktop Software
  • Tools & Utilities
  • Reviews
Click Here!