November 1, 2004

Introduction to Unicode

Author: Michał Kosmulski

Unicode, or the Universal Character Set (UCS), was developed to end once and for all the problems associated with the abundance of character sets used for writing text in different languages. It is a single character set whose goal is to be a superset of all others used before, and to contain every character used in writing any language (including many dead languages) as well as other symbols used in mathematics and engineering. Any charset can be losslessly converted to Unicode, as we'll see.

ASCII, a character set based on 7-bit integers, used to be and still is popular. While its provision for 128 characters was sufficient at the time of its birth in the 1960s, the growing popularity of personal computing all over the world made ASCII inadequate for people speaking and writing many different languages with different alphabets.

Newer 8-bit character sets, such as the ISO-8859 family, could represent 256 characters (actually fewer, as not all could be used for printable characters). This solution was good enough for many practical uses, but while each character set contained characters necessary for writing several languages, there was no way to put in a single document characters from two different languages that used characters found in two distinct character sets. In the case of plain text files, another problem was how to make software automatically recognize the encoding; in most cases human intervention was required to tell which character set was used for each file. A totally new class of problems was associated with using Asian languages in computing; non-Latin alphabets posed new challenges due to the some languages' needs for more than 256 characters, right to left text, and other features not taken into account by existing standards.

Unicode aims to resolve all of those issues.

Two organizations maintain the Unicode standard -- the Unicode Consortium and the International Organization for Standardization (ISO). The names Unicode and ISO/IEC 10646 are equivalent when referring to the character set (however, Unicode Consortium's definition of Unicode provides more than just the character set standard -- it also includes a standard for writing bidirectional text and other related issues).

Unicode encodings

Unicode defines a (rather large) number of characters and assigns each of them a unique number, the Unicode code, by which it can be referenced. How these codes are stored on disk or in a computer's memory is a matter of encoding. The most common Unicode encodings are called UTF-n, where UTF stands for Unicode Transformation Format and n is a number specifying the number of bits in a basic unit used by the encoding.

Note that Unicode changes one assumption which had been correct for many years before, namely that one byte always represents one character. As you'll see, a single Unicode character is often represented by more than one byte of data, since the number of Unicode characters exceeds 256, the number of different values which can be encoded in a single byte. Thus, a distinction must be made between the number of characters and the number of bytes in a piece of text.

Two very common encodings are UTF-16 and UTF-8. In UTF-16, which is used by modern Microsoft Windows systems, each character is represented as one or two 16-bit (two-byte) words. Unix-like operating systems, including Linux, use another encoding scheme, called UTF-8, where each Unicode character is represented as one or more bytes (up to four; an older version of the standard allowed up to six).

UTF-8 has several interesting properties which make it suitable for this task. First, ASCII characters are encoded in exactly the same way in ASCII and in UTF-8. This means that any ASCII text file is also a correct UTF-8 encoded Unicode text file, representing the same text. In addition, when encoding characters that take up more than one byte in UTF-8, characters from the ASCII character set are never used. This ensures, among other things, that if a piece of software interprets such a file as plain ASCII, non-ASCII characters are ignored or in worst case treated as random junk, but they can't be read in as ASCII characters (which could accidentally form some correct but possibly malicious configuration option in a config file or lead to other unpredictable results). Given the importance of text files in Unix, these properties are significant. Thanks to the way UTF-8 was designed, old configuration files, shell scripts, and even lots of age-old software can function properly with Unicode text, even though Unicode was invented years after they came to be.

How Linux handles Unicode

When we say that a Linux system "can handle Unicode," we usually mean that it meets several conditions:

  • Unicode characters can be used in filenames.
  • Basic system software is capable of dealing with Unicode file names, Unicode strings as command-line parameters, etc.
  • End-user software such as text editors can display and edit Unicode files.

Thanks to the properties of UTF-8 encoding, the Linux kernel, the innermost and lowest-level part of the operating system, can handle Unicode filenames without even having the user tell it that UTF-8 is to be used. All character strings, including filenames, are treated by the kernel in such a way that they appear to it only as strings of bytes. Thus, it doesn't care and does not need to know whether a pair of consecutive bytes should logically be treated as two characters or a single one. The only risk of the kernel being fooled would be, for example, for a filename to contain a multibyte Unicode character encoded in such a way that one of the bytes used to represent it was a slash or some other character that has a special meaning in file names. Fortunately, as we noted, UTF-8 never uses ASCII characters for encoding multibyte characters, so neither the slash nor any other special character can appear as part of one and therefore there is no risk associated with using Unicode in filenames.

Filesystem types not originally intended for use with Unices, such as those used by Windows, are slightly different as we'll se later on.

User space programs use so-called locale information to correctly convert bytes to characters, and for other tasks such as determining the language for application messages and date and time formats. It is defined by values of special environmental variables. Correctly written applications should be capable of using UTF-8 strings in place of ASCII strings right away, if the locale indicates so.

Most end-user applications can handle Unicode characters, including applications written for the GNOME and KDE desktop environments, OpenOffice.org, the Mozilla family products and others. However, Unicode is more than just a character set -- it introduces rules for character composition, bidirectional writing, and other advances features that are not always supported by common software.

Some command-line utilities have problems with multibyte characters. For example, tr always assumes that one character is represented as one byte, regardless of the locale. Also, common shells such as Bash (and other utilities using the getline library, it seems) tend to get confused if multibyte characters are inserted at the command line and then removed using the Backspace or Delete key.

If using Unicode sounds appealing, come back tomorrow to learn how to deploy Unicode in Linux.

A continually updated version of this article can be found at the author's Web site.

MichaƂ Kosmulski is a student at Warsaw University
and Warsaw University of Technology.