February 8, 2006

Hacking OpenOffice.org dictionaries

Author: Bruce Byfield

Like many features in free and open source software, OpenOffice.org's spellcheck, hyphenation, and thesaurus dictionaries are based on code from earlier projects. You can learn the basics about them from the Lingucomponent Project, but detailed information is difficult to find. Thanks to an email message from a reader and a discussion on the OpenOffice.org mailing lists, I realized that no one had prepared instructions on how to edit the dictionaries. I began to investigate, using nothing but a file editor and persistence. The investigation was a mixed success, but I did learn enough to make some basic hacks and to note where more advanced methods were needed.

Why edit a dictionary? Some people want to add specialized vocabularies more easily than the built-in tool allows. Others want to expand thesaurus entries or edit all the dictionaries.

If OOo dictionaries are installed for a single user on GNU/Linux, you'll find them in .openoffice.org2/user/wordbook of the user's home directory. If they're installed for the entire system, they will be in /opt/OpenOffice.org2.0/share, assuming that you installed from either the .rpm packages provided by the project or from the .deb packages created with Alien. In other builds of the program, you might have to look further afield. Debian, for example, places system-installed dictionaries in /usr/share/myspell/dicts.

Dictionary files are identified by a language, followed by a locale when necessary. A locale is a variant of the main language. English, for example, has more than a dozen locales, including not only UK and US, but also Canada, South Africa, Australia, and Caribbean. For instance, the main file for the spellcheck dictionary for American English is en_US.dic. Some files also appear now with a version number at the end. Hyphenation dictionary files have a prefix of "hyph," and thesaurus dictionary files have a prefix of "th_." Files also have separate extensions, which I'll explain later. You can find a complete list of available files at the Lingucomponent site.

Basic procedures

Make sure to back up your current file before making any changes to dictionary files. The wrong hack can cause a dictionary tool to crash OpenOffice.org, so you need a reliable backup. If you forget, you can restore the original versions of the files, because the dictionary wizard doesn't delete the downloaded zip files from the dictionary directory.

You can edit most dictionary files in a text editor. However, a few, especially those for the thesaurus, are larger than editors such as Bluefish or gedit can handle. If you can't find anything else, you can edit these large files in OpenOffice.org Writer itself. Be aware, though, that they can crash the program unless you have at least 500MB of RAM. Note, too, that you should use File -> Save As and disable the automatic filename extension. Otherwise, you'll need to open a file manager and change the extension manually so that it can work with OpenOffice.org.

Installing dictionaries manually

Most people install dictionaries using File -> Wizards -> Install new dictionaries. This tool installs the latest dictionaries. However, if you want the latest dictionary for a specific language or locale, you might need to install it manually.

OpenOffice.org is cross-platform, so dictionary files are generally stored in zip files. Once you download the zip file, extract the files to a local or system dictionary directory and edit the dictionary.lst file.

The dictionary.lst file is a record of all installed dictionaries. Its lines are easy to read. For example, the files for US English are:

DICT en US en_US
HYPH en US hyph_en_US
THES en US th_en_US_v2

Each line contains:

  • An identifier for the type of dictionary: DICT, HYPH, or THES
  • The two-letter code for the language, in lower case letters
  • The two-letter code for the locale in upper case letters
  • The actual file name, without an extension

Although you can use any file name you want, you must get the language and locale code right in order for the thesaurus to work with OpenOffice.org.

The new dictionaries are available for use the next time you start OpenOffice.org. You can check whether they're working by assigning a paragraph to the language, then running a spellcheck for it.

Adding a user-defined dictionary

User-defined dictionaries are usually lists of specialized vocabulary words. You can create a user-defined dictionary for a body of specialized knowledge to make your spellcheck more efficient and less wasteful of your time. User-defined dictionaries for Sun Microsystems and StarOffice are part of OpenOffice.org, and you
can view them from Tools -> Options -> Language Settings -> Writing Aids. Another dictionary installed with the program is IgnoreAllList, a list of words that are ignored when you spellcheck (see Tools -> Spellcheck).

You can also create a user-defined dictionary from Tools -> Options -> Language Settings ->Writing Aids. Select the New or Edit button to edit a user-defined dictionary one word at a time. Enter the word in the field provided, then press the Enter key.

The trouble with this interface is that it lacks both tools and a display suitable for large dictionaries. As a workaround, you can create the dictionary and add a single word as a sample, then open the dictionary in a text editor, where you can see all the entries. Here you can search and replace and copy and paste.

When you open a user-defined dictionary in a text editor, you'll find a first line something like this:

##WBSWG6ÿ####Debian

In this example, "Debian" is the first entry you created to show you how it's done. The only rule you have to worry about is that four number signs (or pound signs) must separate each entry. You can add multiple entries per line, but adding one entry per line is clearer. For instance, after you add the next entry, the previous example might look like this:

##WBSWG6ÿ####Debian
####Slackware

Hacking spellcheck dictionaries

OpenOffice.org spellcheck dictionaries follow a system known as MySpell, which is a modified version of Ispell's structure. Under MySpell, spellcheck dictionaries consist of two text files: one with a .dic extension and one with an .aff extension. The .dic file contains the list of words for the locale, while the .aff file contains a series of rules by which you can add affixes -- prefixes and suffixes -- to the spellchecker without needing to list them in the .dic file.

Removing an entry from a MySpell dictionary is only a matter of opening the .dic file, deleting, and saving. The .aff file doesn't have to concern you at all.

Adding an entry can be just as simple. The drawback to this solution is that the dictionary won't detect any variants of the word. For example, if a member of the Penguin Company adds the company's name to the dictionary, the dictionary will recognize "Penguin" as a word, but not "Penguin's." You must add "Penguin's" separately in order for the dictionary to recognize it. In some cases, you'd need to think of half a dozen variants and add all of them.

The .aff file offers a more elegant solution. A forward slash and one or more uppercase letters follow many words in a .dic file. These uppercase letters point to specific rules in the .aff file that you can use to extend the recognition of the basic word in the .dic file to variants.

The .aff file for a spellcheck dictionary contains three kinds of entries: prefix rules, suffix rules, and rules. For the purposes of editing a dictionary, you can ignore the rules and focus on the prefixes and suffixes.

Each prefix and suffix entry consists of a general summary line, followed by a list of prefixes or suffixes covered by the rule. For example, the first prefix rule in the en_US.aff file reads:

PFX A Y 1

This line is a summary of the rule. It has four columns:

  • Column 1 shows whether it is a prefix (PFX) or suffix (SFX) rule.
  • Column 2 shows the code used to point to the rule in the .dic file. Each prefix or suffix rule has a unique code.
  • Column 3 shows whether the rule can be combined with rules of the opposite kind. Options are Y (Yes) or 0 (No).
  • Column 4 shows the number of variants included in the rule. Rules are grouped according to a common purpose, such as forming a plural or a past tense.

The lines below the summary give the specific prefixes or suffixes covered by the rule. These entries appear in five columns. The first three are the same as for the summary, except that the third column is always set to 0. The fourth column is the prefix itself, and the fifth shows the rules that apply to it.

For hacking purposes, you need to add the second column of the summary and the fourth column of each individual entry to the .dic file. When you add a word to the .dic file, open its .aff file, note the codes for all the prefixes and suffixes that might apply to the word, and enter them after the word. For example, if you add this:

Penguin /M

The entry now covers both Penguin and Penguin's. As you might expect, the more meanings a word has or the more forms of speech it can take, the more affix codes it is going to have.

Basic thesaurus hacks

OpenOffice.org has two thesaurus files: one with a .dat extension and another with an .idx extension. Both files are editable, but unfortunately, I have been unable to edit the .idx file in any way that would enable users to add words to the thesaurus for a language and locale.

However, you can delete or modify existing entries. These abilities might be useful if you're concerned that profane words might pop up in the software when it is used by children. The English dictionaries, at any rate, include what might be euphemistically called earthy alternatives -- a matter for some debate and hilarity, but a possible hindrance to educational use.

The entry for each word in the .dat file is mostly self-explanatory. Here's a random example:

cake|4
(noun)|bar|block (generic term)
(noun)|patty|dish (generic term)
(noun)|baked goods (generic term)
(verb)|coat|cover (generic term)|spread over (generic term)

The first line gives the word and the number of different meanings. Each meaning is on its own separate line, starting with its part of speech in brackets and with suggested terms separated by a bar. Most terms are further specified by a description such as generic term, related term, or antonym.

You can delete an entry by removing it from both the .dat and .idx files. If you want to add additional meanings, follow the formats of other entries and change the number of meanings in the first line for an entry.

Another simple hack is to add a thesaurus to a locale that you've installed. Compiling a thesaurus is a much larger undertaking than creating a spellcheck dictionary, so it often takes a while for a locale to have its own thesaurus. You can add a thesaurus to a locale by copying and renaming a thesaurus from a similar locale and adding the name of the new file to the dictionary.lst file in the same way as you would when installing a dictionary manually.

More advanced hacking

You also might want to learn how to produce your own thesaurus by adding specialty words. For instance, if you speak German, you might want to look at the OpenThesaurus site. There's also a SourceForge site for OpenThesaurus, but little seems to happen on it.

OOo's TeX-based hyphenation dictionaries have been refined over years, so it's unlikely that you'll need to edit them. At any rate, I doubt that OpenOffice.org is equipped to handle some TeX codes, such as the one for kerning. OpenOffice.org has inherited these codes, like some free software version of an appendix. Still, it might be useful to add words that should not be hyphenated to the dictionary.

In this article, I've only mentioned hacks that you can implement without recompiling anything or using special tools. However, if you're more ambitious, you can find help at the Lingucomponent project site.

The main point of this article is not to definitively describe how you can hack dictionaries in OpenOffice.org, but rather to let you know it can be done. If you discover any other dictionary hacks -- or any other hacks for OpenOffice.org files -- let me know. If I collect enough hacks, I might publish the best of them in a follow-up article.

Bruce Byfield is a course designer and instructor, and a computer journalist who writes regularly for NewsForge.

Click Here!