May 1, 2006

Controlling your locale with environment variables

Author: Bill Poser

People all over the world use Linux in dozens of languages. Since Linux's source code is free and open, speakers of minority languages can add support for their languages themselves, even though a large corporation might not consider them a worthwhile market. If you use more than one language, or a language other than English, you should know about Linux's use of locales to support different languages. Indeed, understanding locales can be useful even if you only use English.

You choose your locale by setting environment variables. Different variables control different things. LC_MESSAGES determines the language and encoding of messages as well as of labels in GUI components if they use GNU gettext or one of its relatives to obtain translations. A few programs obtain translations in other ways and may not be affected by LC_MESSAGES.

LC_CTYPE defines character classes, which are named sets of characters used by a variety of programs, especially regular expression matchers. In programs such as grep that use character classes like [:alnum:], class membership varies with locale. If the writing system distinguishes upper- and lower-case letters, LC_CTYPE specifies the relationship between the two.

Sort order is controlled by LC_COLLATE. Sort orders can differ within the same writing system. In ASCII order, the upper-case letters as a group precede the lower-case letters: A B C ... a b c.... In French, upper-case letters immediately follow their lower-case counterparts: a A b B c C.... In some other locales upper-case letters immediately precede their lower-case counterparts: A a B b C c....

Sort order affects not only sorting but character ranges in regular expressions. In ASCII order, the range expression [A-Z] stands for the 26 upper-case letters. That is because the expression stands for the characters starting with A and ending with Z. In ASCII, the upper-case letters form a contiguous block of 26 running from A to Z. However, in the French locale the letters starting with A and ending with Z consist of all of the letters except for lower-case a.

The format of times and dates is controlled by LC_TIME. In American English, the date command produces results like this:

Fri Apr 14 20:33:28 PDT 2006

In Catalan, the language is different but the format is the same:

dv abr 14 20:33:49 PDT 2006

But in Japanese the year comes first, then the month, then the day, then the day of the week, and finally the time:

2006年 4月 14日 金曜日 20:34:42 PDT

The format of numbers is determined by LC_NUMERIC. In American English, a period serves as decimal point and commas break the integers into groups of three:

652,314,159.278

In German, the grouping is the same but the roles of period and comma are reversed:

652.314.159,278

In Hindi the period and comma play the same roles as in American English, but the integers are broken into groups of two, except for the hundreds downward, which form a group of three:

65,23,14,159.278

The format of amounts of money is controlled by LC_MONETARY. In addition to the formatting of the numbers, this environment variable specifies the symbol for the unit of currency and things like how negative values are written.

Additional variables include:

  • LC_PAPER: paper size
  • LC_NAME: personal name format
  • LC_ADDRESS: address format
  • LC_TELEPHONE: telephone number format
  • LC_MEASUREMENT: measurement units

The environment variable hierarchy

Most of the time you'll want to set all of these variables to the same value, but occasionally it makes sense to use different values. For example, if you want your interface language to be English but are sorting French data, you could set LC_MESSAGES to en_US but LC_COLLATE to fr_FR.

When you don't want to use different values, it isn't necessary to set all of the variables individually. Instead, you can set LC_ALL or LANG. When a program looks at the environment variables to determine what locale to use, it follows the following procedure:

  1. If the LC_ALL environment variable is defined and is not null, its value is used.
  2. If the appropriate component-specific environment variable -- e.g. LC_COLLATE -- is set and non-null, its value is used.
  3. If the LANG environment variable is defined and is not null, its value is used.
  4. If the LANG environment variable is not set or is null, an implementation-dependent default locale is used.

Notice that this means that if you want to use different locales for different purposes you should unset LC_ALL.

There is one further environment variable, LANGUAGE, which is used only by GNU gettext, the system that provides translations of messages for many programs. Unlike the others, this variable can be assigned multiple locales separated by colons. The locales are tried in order until a message catalog is found. For example, the specification sv_SE:nn_NO:de_DE indicates that the user prefers Swedish, to use Norwegian if Swedish is not available, and to use German if Norwegian is not available. If defined, LANGUAGE takes precedence over LC_ALL, LC_MESSAGES, and LANG.

Locale names

Locale names take the form:

language(_territory)(.encoding)(@modifier)

The only obligatory part is the language code, such as en for English. The same language may be used in different ways in different countries, so locale names commonly include a territory code as second component. Thus, fr_FR is the locale for French in France, fr_CA for French in Canada, and en_CA the locale for English in Canada. An encoding may also be specified. English will by default be encoded in ASCII, but en_US.UTF-8 specifies American English in the UTF-8 Unicode encoding. Language codes are usually taken from the list of two-letter codes defined in ISO-639-1, country codes from the two-letter codes defined in ISO-3166-1.

The modifier you are most likely to see is euro, which is used for locales using the euro currency where the original locale definition was created before the introduction of the euro in the European Union. For example, es_ES@euro is the current Spanish locale, while es_ES is the pre-euro locale.

A few special locales do not follow these naming conventions. The POSIX locale, also known as the C locale, sets up a traditional Unix environment, with the ASCII encoding, POSIX character classes, and American English time, date, number, and monetary formats. If things seem off when you use programs such as sort and grep, the problem may be that you are using a locale that defines a different sort order or character classes from what you are used to. Setting your locale to POSIX may set things right.

To find out what your current locale is, use the locale command with no arguments. It will print the values of all of the relevant environment variables except for LANGUAGE. locale charmap prints the name of the current encoding. To find out what locales are available, type locale -a. To find out what encodings are available, type locale -m.

Installing locales

You may find that the locale you need is not installed on your machine. If so, you may be able to install it yourself. Locale definitions are stored in plain ASCII files that follow a special format described in the locale(5) man page. How to write your own locale definitions would be fodder for a whole other article, but you can find quite a few locale definitions in localedata/locales in the glibc distribution directory, in files whose names are the same as the locale name, e.g. zu_ZA for Zulu in South Africa.

To install a locale file, use localedef. This command installs
Zulu:

localedef -i zu_ZA -f ../charmaps/UTF-8 zu_ZA

The last argument here is the locale name. The argument to the -i flag is the locale definition file. The argument to the -f flag is the appropriate character set definition file, most frequently UTF-8.

Looking forward

Many languages do not have a two-letter code, so three-letter codes from ISO-639-2 are increasingly seeing use. Similarly, three-letter country codes defined in ISO-3166-2 are on the horizon.

Another development is the Common Locale Data Repository project sponsored by the Unicode Consortium. Recognizing that it is inefficient for different operating systems and developers to compile localization information and store it in different formats, the CLDR project is developing a new, XML-based, cross-platform system for storing locale information which is already in use by Java and some other programs.

Click Here!