Author: Bruce Byfield
Unicode is an effort to map the characters of all human languages for use with computers. Version 5.0 of Unicode, released in the fall of 2006, contains nearly 100,000 characters and has the capacity for about a million. Support for Unicode in software is well underway, usually via one of the Unicode Transformation Formats: UTF-8, UTF-16, or UTF-32.
An important part of the implementation of Unicode in software is support for the Common Locale Data Repository (CLDR). As Zmievski explains, this concept of locales goes far beyond the traditional concept of locales in POSIX-like systems such as GNU/Linux. It includes not just character sets, but also linguistic and cultural preferences for such things as date formats, currencies, and — of particular interest to programmers — how data is collated and sorted. In German, for example, characters with umlauts appear immediately after regular characters, while in Swedish they are added at the end of the alphabet.
Such information is not static, Zmievski emphasizes, but can evolve over time. For example, the euro currency was recently added to many European locales. Similarly, modern Spanish has different rules for collation than traditional Spanish. The CLDR currently lists 360 locales for 121 different languages.
For PHP, Zmievski says, the problem is that the “core language knows little to nothing about encoding and processing multilingual data. In current versions of PHP, extensions such as
mbstring rely entirely on POSIX locales.” Unicode support is possible in current versions, Zmievski says, but “there’s a lot of hoops to jump through.”
As a result, those programmers fortunate enough to be comfortable in English, which tends to be the dominant language of computing, often see little reason to care about Unicode. “One of the things I notice,” Zmievski says, “is that when I give my talk in countries outside the US, I get a full room. When I talk in the US? I get maybe a dozen.”
However, with the increasing internationalization of the Internet — as evidenced by the increased demand for international domain names in non-Latin character sets — Zmievski insists that PHP, like every other language, “has got to be able to confront the evolving world. There is a demand, even if people don’t realize it for themselves yet.”
Changes in PHP
According to Zmievski, Unicode support in PHP 6.0 will include a broad selection of International Components for Unicode (ICU). These components will include provision for such actions as converting between one locale or character set and another, collation, transliteration, Unicode text processing, and Unicode regular expressions. Such functionality will be available when a
Unicode.semantics code switch is enabled.
To accommodate this change, PHP 6.0 will switch from having a single, generic string type to having two: a Unicode string type for text data, implemented through UTF-16, and a binary type, which will include actual binary data and text data for legacy locales. Perhaps the most obvious difference in the string types is that each character in a binary string will be one byte long, while in a Unicode string, a character may use more than a single byte, depending on the language and how it is encoded. In addition, within Unicode strings, characters may be referenced by either name or code point.
When a PHP program runs, runtime encoding will specify which encoding to use. The encoding for a script will be encoded either as an INI setting, or with a
declare () statement in the first line, in much the same way as in an XML file. The encoding may be changed later in the script with a pragma. The encoding for standard output and for file and directory name may also be specified, as well as how conversions between the two string types are handled. Since legacy character sets cannot support all Unicode characters, programmers will also be able to set how conversion errors are handled and the format in which PHP reports them.
With Unicode support, not only will identifiers within the code be able to use Unicode characters, but a whole range of new functionality will become available. Programmers will be able to specify how information is collated by choosing a locale, and by specifying criteria, such as how accented or upper case characters are treated. Even more usefully, text can be converted from one locale to another, so that, for example, English speakers can read Greek names in Latin characters, or a Japanese reader can convert full-width characters to half-width ones on the fly.
Adding Unicode support, Zmievski warns, will cause a certain amount of obsolescence in PHP. The
set locale () string function, for example, will be deprecated. Zmievski also anticipates that “a couple of .ini options, and a couple of functions” will join it, but insists that “everything will work transparently” in the end.
The current state of Unicode development
According to Zmievski, most of the basic Unicode functionality is complete. The PHP Unicode team is currently analyzing functions to check which ones will require upgrading. As of February 12, he estimates that 61%, or 1,844 of 3,047 extension functions, were ready for Unicode support. He is hoping for an alpha release by the end of the first quarter of 2007, and the final version of PHP 6 by the end of the year.
In addition to upgrading functions, the Unicode team also faces other problems. “We need to start working on documentation,” Zmievski says, “documenting not just the behavior of functions but the features that have changed, and then an introduction to generic Unicode — what it does, what it needs, and how to work with it.”
However, he adds, “The largest problem is figuring how we build this thing so that you can run your PHP 5 scripts on PHP 6 without a few of them blowing up.”
Perhaps equally importantly, Zmievski sees a lot of educational work that is needed. Many in the PHP community, he suggests, are only vaguely aware of the growing necessity of Unicode, and are holding back from using the developer builds of PHP 6. He compares this attitude to the reaction of most people to the danger of earthquakes. “I live in San Francisco,” he says. “Everyone knows that you should have an earthquake-preparedness kit. But how many people do that, and how many actually keep them up to date?” In much the same way, while the PHP community knows that the changes are coming, Zmievski worries that it may not be readying for them.
Zmievski talks regularly at conferences about this work in progress, explaining the need for it and encouraging other programmers to start experimenting with the work his team has already done. “That’s why I give my talk,” he says. “So that people will know that, yeah, they can basically start using it.”
Bruce Byfield is a computer journalist who writes regularly for NewsForge, Linux.com, and IT Manager’s Journal.