February 6, 2006

Setting up international character support

Author: Bruce Byfield

Like other operating systems, GNU/Linux is starting to add increased support for international characters. The support is spotty in places, and varies between systems because of differences in keyboards, distributions, fonts, and program support. Even so, if you make a few configuration changes, you can use the keyboard to enter the characters for dozens of languages with only a few problems.

Character encodings are usually called locales. The first locales were based on ASCII, originally a 128-character table created for modern English. Other encoding tables, such as Extended ASCII, ANSI and ISO8859 expanded
the number of characters to support other languages, especially European ones. Today, all these standards are being superseded by Unicode, an encoding scheme which is attempting to include every character in every written language. Backwardly compatible with ASCII, Unicode is most often implemented by 8-bit Unicode Transformation
Format (UTF-8), although other variations also exist, such as UTF-16, which is used in Java.

Created in 1992 by Ken Thompson on a placemat in a New Jersey diner, UTF-8 has today become a computing standard. Most recent Linux distributions support UTF-8, although many, including Debian, give users the option of using legacy locales that contain only the characters needed for a specific language.

In theory, UTF-8 allows you to read, edit, and print text in various languages, even to mix multiple languages in the same document. In practice, however, the usefulness of UTF-8 is limited by the fact that most fonts support only
a limited part of the encoding table. In fact, some fonts -- especially free ones -- support less than the original ASCII table. A particular problem is support for languages that do not use Latin, Greek, or Cyrillic alphabets, such as Hebrew, Arabic, Japanese, Chinese, and Korean. Often, the details of configuring a multilingual system that includes just one of these languages would require an article in itself. By contrast, configuring a computer to support European languages is much easier and often allows users to continue to use fonts designed for legacy locales.

If you want your computer to support a particular language, you can set that language as your locale. You can even add multiple locales and switch between them, as explained below. You may even be able to create your own locale, if you have the time and expertise. But if what you want is a wide variety of diacritical marks in case you need them, then your simplest choice is a UTF-8 locale.

Reconfiguring a system for international characters

Reconfiguring the locale requires two steps: Changing the system locale, and changing the keyboard mapping in the X Window System.

You can tell which locales are enabled on your system with the command locale -a. A legacy locale that supports only one language is identified in the locale command's output by an abbreviation for the language followed by one for the variant of the language. For example, en_GB is the legacy locale for English in the United Kingdom. A locale supporting UTF-8 adds the extension .utf8, which means you can still see the language and variant and use them as indicators of the general keyboard layout, as well as how the system should display dates and numbers.

You can enable a locale for the system by logging in as root and adding the locale's name in a separate line to /etc/locale.gen, then entering the command locale-gen. After that, setting the system to use a new
locale is a matter of editing the /etc/environment file. For example, to use the UK English UTF-8 locale, change the LANG= line to read:

LANG=en_GB.UTF-8

If you use Debian, you can use the dpkg-reconfigure locales command to take care of all these steps.

The next step is to edit the keyboard mapping for the keyboard in X Window System by editing
either the XF86Config-4 or xorg.conf file (whichever the system has) in /etc/X11. Change the XkbLayout line in the file so that the locale matches the one
for the system. Note that the locale is entered entirely in lower case letters. For example, if you want to enable UTF-8 with an American English locale, the edited line would be:

Option "XkbLayout" "en_us.utf-8"

If you want to enable multiple mappings for the keyboard, you can use the setxkbmap command, or gnome-keyboard properties if you're using GNOME, or use the Keyboard Layout settings in the Regional & Accessibility section of the KDE Control Center. Because it is user-specific, setting the keyboard mapping in GNOME can be more flexible, as explained below.

Alternatively, you can add multiple locales in a comma separated list to the XKBLayout line. If you want to switch between them, you also need to add another option to the keyboard section that defines the key that changes the keyboard being used. For instance, if you wanted to switch keyboards by pressing the Alt and Shift key,
the line would read:

Option "XkbOptions" "grp:Alt_shift_toggle"

A complete list of keys that you can use is found in etc/X11/xkb/rules/xorg.lst (or xfree86.lst). The key definition must end with _toggle.

Once you have changed the locale and keyboard mapping, either restart the X Window System or reboot the system.

KDE lets the system define locales, but GNOME maintains definitions in the .gconf/desktop/gnome/peripherals/keyboard/xkb/%gconf.xml file of each user's home directory. For this reason, when you start GNOME after changing locales, you get an error message and a dialog window. You can use this window to choose to use either the X or GNOME locale.

GNOME does this only when legacy locales are involved in the changes. GNOME does not react to changes between UTF-8 character encodings. In other words, a change in locale from us (American English) to en_US.UTF-8 (American English with UTF-8 support) is queried, but a change from en_US.UTF-8 (American English with UTF-8 support) to en_GB.UTF-8 (UK English with UTF-8 support) will not be queried. The response can be a nuisance, but it does let you define keyboards differently for each GNOME user by using gnome-keyboard-properties.

Using international characters from the keyboard

You can enter extended Latin characters from the keyboard in three ways: deadkeys, the Compose and AltGr key, and the Multi_key -- all of which we'll explain in a moment. Which you have, and how functional each one is, depends on the hardware and software details of your system.

You will also need a font that supports UTF-8; otherwise, you may not be able to view what you are typing to see if the mapping is properly set up. When trying out a changed mapping try using Arial, Times New Roman, or Bitstream Vera Sans, or a modern font such as Gentium, all of which are generally available on GNU/Linux and have a reasonable selection of characters.

Deadkeys are so named because pressing them displays nothing on the screen unless you press another key immediately afterwards. Deadkeys are used in many locales to enter a character with a diacritical mark. For example, if you are using a French locale and you press the apostrophe key followed by the letter "e," you will get an "e" with an acute accent (é). Both UTF-8 and many legacy locales support deadkeys, but us, perhaps the most common legacy locale, does not. In some cases, the XF86Config-4 or xorg.conf file in /etc/X11 may have the following line:

Option "XkbVariant" "nodeadkeys"

You must delete or comment out this option if you want deadkeys to work.

When using a UTF-8 locale, you may also be able to use the Compose key (the right Windows key) or the AltGraph key (Right Alt, or AltGr) to enter other extended characters, such as the copyright sign (©). These keys enter characters defined in /usr/X11R6/lib/X11/locale./locale name/Compose. The file lists a number of alternate key combinations for each character, only some of which are likely to work on your keyboard. In most instances, these are defined as starting with a Multi_key, but they may also work with the Compose or AltGraph key. To see if they do, just press the Compose or AltGraph key, release it, then enter one of the listed combinations.

In theory, you should be able to add your own definitions to the Compose file. So far, however, I have been unable to do so under either Debian and Fedora Core 4.

If you find that the Compose and AltGraph keys do not work at all, or only work to input a few characters, one solution may be to try another locale. Another solution is to modify the keyboard definition in the XF86Config-4 or xorg.conf file by adding the following line:

Option "XkbOptions" "compose:rwin,grp:switch"

In GNOME, programs using GTK-2 may require adding a new line to /etc/environment:

export GTK_IM_MODULE=xim

If all else fails, you can define a Multi_key to use in place of deadkeys or Compose or AltGraph. A Multi_key works exactly the same way as the Compose or AltGraph keys, except that you define which key it is. You can define any key as the Multi_key, but your life will be easier if you choose a key that has no other use, such as the Left Windows key. The Multi_key must be defined individually for each user on the system.

To define a Multi_key, you need the keycode for the key you want to redefine. The keycode varies with the keyboard. To learn it, start xev from the command line to open an event window. Then, with the cursor in the event window, press the key to see the keycode listed in the command line output.

Once you have the keycode, create a plain text file in the user's home directory. If you name the file .Xmodmap, it will be recognized automatically by the system. Open the file in a text editor and add a line that
defines the Multi_key. For instance, on my keyboard, the Left Windows key has a keycode of 117, so I would enter:

Keycode 117 = Multi_key

If you choose not to call the file .Xmodmap, enable it by entering the command:

xmodmap file name

This step is unnecessary if you use the default name.

Either way, the next time that you log in, the Multi_key should be functional.

Whether you are using the Compose, AltGraph, or Multi_key, when you have found which keys work, consider making a list of the keystrokes needed for the top 20 or so characters that you are likely to use. Hung beside your monitor, such a list is probably the least painful way of converting to a new locale, especially a UTF-8 locale.

Annoyances and limitations

Although GNU/Linux support for UTF-8 has expanded greatly in the last few years, it still has limitations. For touch typists used to a legacy English keyboard mapping probably the greatest annoyance is the need to press the apostrophe key followed by the space bar to type a straight quotation mark. The ability to use the apostrophe key by itself as a deadkey for adding grave accents is handy, but learning a new way to type quotation marks means relearning a reflex action. While keyboard mappings can be edited to correct this problem, this is probably more than
casual users are prepared to do.

Another problem comes into play with international characters in HTML. While most browsers support UTF-8 to at least some degree, and best practices suggest page designers define the locale at the start of an HTML file, many users are still using legacy locales -- often English ones. Therefore you should, if you want to reach the widest possible audience, enter extended characters in decimal format.

An even greater limitation is that many programs only partly support UTF-8. Often the problem is that the application specifies a font that does not support international characters. In some cases, the difficulty runs deeper. For instance, gedit displays international characters, but may crash if you try to print them. Similarly, you may need to configure Mozilla to use xprint if you need to print the full range of international characters. You may also need to reconfigure some versions of vim and Emacs. Even more frustratingly, OpenOffice.org supports only deadkeys, so if you want to use characters such as the Euro sign, you need to record a macro and assign the macro to a key combination in order to enter it from the keyboard in OOo.

However, the greatest problem is the BASH shell. When the changes described here are implemented and BASH is run without the X Window System, it can display UTF-8 characters, but cannot accept them as input. True, you can international console support with the help of the commands kbd_mode, dumpkeys, and loadkeys to set the keyboard mappings and the consolechars command to set the font, or with the help of distribution-specific
scripts such as Debian's /etc/init.d/keymap.sh. However, a lack of loaded kernel support or of suitable console fonts may complicate the procedure. Moreover, mistakes can easily leave your system in a non-usable state. For these reasons, if you want to use international characters in file names, you might want to reconcile yourself to using ls and copying and pasting rather than typing at the command line -- and to forgoing aliases that use extended characters.

It's difficult to make generalizations about what level of UTF-8 support to expect, because of all the different possible combinations of hardware and software. Much of the time, you can learn only through experimentation. Even so, support for international characters has come a long way in the last few years, when the only way to add characters and diacritical marks foreign to your locale was to pick them out one at a time from a character map window.

Long-time users of legacy locales may have to learn a few new habits and unlearn a few old ones when using UTF-8 characters. Yet these difficulties are nothing that the speakers of many languages have not had to endure for years. And, with the rise of UTF-8 locales and keyboard support, all users are finally starting to participate in computing on an equal footing.

Bruce Byfield is a course designer and instructor, and a computer journalist who writes regularly for NewsForge, Linux.com and IT Manager's Journal.

Click Here!