KDE 4’s Sonnet will turbocharge language processing

173

Author: Nathan Sanders

With the Sonnet library for KDE 4, developer Jacob Rideout hopes to reinvigorate the field of desktop linguistics by adding automatic language detection and other innovative features. Sonnet is to be for KDE 4 what KSpell 2 is for the current version of the K Desktop Environment, providing spellchecking facilities to applications as diverse as the Konqueror Web browser, Kopete instant messenger, and KWord office software. Unlike KSpell, however, it will also provide grammar checking, multilingual tools, and perhaps even translation, dictionary, and thesaurus functionality across all of KDE.

KDE 4 may take even the spellchecking feature a step further than it has in the past. Rideout says, “There is currently a discussion in the mailing list on enabling spellcheck for every textedit box in KDE.” Sonnet will also provide text statistics such as word counts and readability scores. Rideout hopes to eventually implement automatic text completion as well.

Because Sonnet is a library accessible to all KDE applications, Rideout foresees applications beyond text editing programs. Its language detection feature is particularly ripe for unexpected usage. Sonnet is capable of determining the language a text is written in given about 20 characters of data. This feature already works for several dozen languages. According to Rideout, the Strigi desktop search developers are considering integrating language detection into their application’s search features. Perhaps users will, one day, be able to search for “documents written in Spanish within the past week.”

Rideout, who recently earned his bachelor’s degree in linguistics, says that improved multilingual support is the “most requested change” from KDE 3 and it is here where language detection has the most potential. He says, “Users will be able to have documents checked for correctness in a fine-grained manner. Any separate section of a document (by default, this means a paragraph) will be checked in its respective language by the tools available for that language. For convenience, each section will have its language detected automatically, with the option of a user disabling or overriding the detection.”

Without Sonnet’s language detection, a French user who must frequently correspond with British associates must manually change the language library used by his spellchecker each time he switches languages. In KDE 4, Sonnet will automatically notice that he has begun typing in another language and check for spelling errors accordingly. In a more complex scenario, the user may quote paragraphs from an English speaker within an email he is writing in French. In KDE 3, the English portion will be interpreted by KSpell 2 as horribly misspelled French and continuously underlined in red. With Sonnet, the English text will automatically be spellchecked in its own language, just as the French is in its.

“Language detection in Sonnet was initially based on a Perl script named Languid created by Maciej Ceglowski,” Rideout reports. “I ported the Perl script to C++ and have been regularly modifying it so that, while it shares the same algorithmic approach, it is no longer a direct port…. Sonnet was initially ‘based on’ Languid, but never used any of its code, and now has diverged in several significant ways…. “[Languid’s] author has kindly granted KDE a license using the LGPL so that our derivative could be maintained and distributed as part of KDE’s libraries.”

Languid is fundamentally based on a technique called “N-Gram-Based Text Categorization” published by William Cavnar and John Trenkle. A gram is a segment of text made of N number of characters. Sonnet uses trigrams, made from three characters. By analyzing the popularity of any given trigram within a text, one may make assumptions about the language the text is written in. Rideout gives an example: “The top trigram for our English model is ‘_th’ and for Spanish ‘_de’. Therefore, if the text contains many words that start with ‘th’ and no words that start with ‘de,’ it is more likely the text is in English [than Spanish]. Additionally, there are several optimizations which include only checking the language against languages with similar scripts and some heuristics that use the language of neighboring text as a hint.”

Sonnet checks each paragraph of a text individually for its language, though Rideout says that it would be possible to check per-sentence. Paragraphs are used because a larger sample sizes yield more accurate results, and because checking every sentence of a large document for language would be needlessly taxing.

With Sonnet, Rideout says that a user may select a primary and backup dictionary rather than using language detection, in which case a word found to be misspelled in the primary language would be assumed to be of the backup language and spellchecked accordingly. This would be useful, for example, to doctors who must frequently use terms from a medical dictionary. Language detection offers yet more innovative functionality in the way of layout hints. For instance, a paragraph written in Hebrew (a language that is read from right to left) could be automatically right-aligned on the page.

In order to do all this computational linguistics in the background without disturbing or slowing the user interface, Sonnet uses KDE’s Threadweaver technology. By intelligently dividing execution jobs into different threads via Threadweaver, Sonnet can perform language detection and spellchecking on a document without interrupting a user’s typing. Rideout is among the early adopters of Threadweaver. He says, “On single processor systems the speed might sometimes be slower than the old Kspell code in theory, but not on a user-perceivable scale. On multiprocessor systems, the speed increases greatly. The ease of development is also substantially less than with other approaches.”

Despite innovative features like language detection, Rideout is careful not to reinvent the wheel for Sonnet. For spellchecking, Sonnet uses a plugin system that will most likely defer to Abiword’s Enchant library. Enchant, in turn, defers spellchecking duties to a variety of standard spellchecking libraries such as Aspell. To complement Enchant, Rideout is developing a grammar-checking library to be titled Elixir. Elixir will serve as a common interface to several existing free-software grammar checkers, such as An Gramadóir and LanguageTool. He expects Abiword to adopt Elixir once it is completed and he is seeking FreeDesktop.org standardization of the plugin system to be used in both Enchant and Elixir. He envisions the major KDE and GNOME text editors using Enchant and Elixir directly, while OpenOffice.org and other editors will use the standards-compliant plugins.

Genesis

The Sonnet project began with a blog post by Zack Rusin, a Trolltech developer with about half a decade’s worth of KDE experience, and the principal developer of KSpell, Sonnet’s predecessor. In May of 2006, Rusin proposed a “full linguistic framework” for KDE in his blog. He spoke of augmenting the one standard feature of desktop linguistics, spellchecking, with support for grammar, dictionary, thesaurus, and translation tools.

Rusin works on the Qt toolkit from which KDE is built, and thus focuses his work primarily on various aspects of computer graphics, not linguistics. He says, “Linguistics is fascinating and for some reasons there’s not a whole lot of people who’d want to deal with it, at least not as far as its desktop usage goes.”

At the time Rusin proposed Sonnet, Jacob Rideout had only minimal experience with KDE development, having contributed a few bug reports and patches. But Rideout was putting his linguistics degree to good use developing Phrasis, a “stripped-down text editor” designed for writers on the KDE platform.

It seemed necessary to Rideout that his text editor for writers have grammar correction ability. “I looked for a Qt/KDE wrapper for a grammar checker under the GPL that I could integrate into Phrasis and found none in KDE, but did find a plugin for Abiword. I adapted their wrapper to Qt and informed the KDE developers in case they wanted to use it. Several KDE developers, especially those working on KOffice, were excited, but there was no one willing to do the work at that time. So, after doing some research and talking to people like Zack [Rusin], I started hacking.” Rideout now has his own SVN account with the KDE project and has set Phrasis aside to develop Sonnet. Rusin is continuing on with the project in an advisory capacity.

According to Rideout, “Everything is in a 75% done stage in terms of function.” So far, it seems that Rideout’s greatest barrier has been learning to program in a multi-threaded environment. He says, “My development methodology has traditionally been to jump in and start coding, perhaps 10-20% of the needed code. Then, I step back and examine my goals and design a proper solution. The code is refactored and the architecture is implemented…. Essentially I take the same approach to code as with prose — write many drafts and be a remorseless self-editor.” Rideout laments that this programming style has suited him well in the past, but “can create a mess” in a multi-threaded environment. He says that “Most of the work being done currently, while publicly available in the KDE Subversion repository, has yet to be fully reviewed by the veteran KDE gurus.”

Rideout says that translation is a low-priority feature for Sonnet that may not be present for the release of KDE 4, though he would like to implement it in the future and hopes that others will sign on to aid the effort. Similarly, he does not list dictionary or thesaurus tools under the “core components [that] will be ready for the 4.0 release [of KDE].”

Like much of KDE 4, Sonnet promises to bring significant change to a prevailing feature of desktop operating systems. Multilingual users will see the bulk of the improvement, thanks to automatic language detection, but others may still enjoy grammar correction, text statistics, and future features such as translation, dictionary, and thesaurus functionality.