Posted by: Anonymous Coward
on November 14, 2004 08:22 AM
UTF-8 is far from an ugly hack. It is a way to represent Unicode characters as a sequence of bytes. It was beautifully designed:
o It is ASCII transparent: any sequence of ASCII bytes is a correct sequence of UTF-8 bytes that corresponds to the equivalent Unicode characters. Reciprocally, any non-ASCII Unicode character is guaranteed to be encoded using only non-ASCII characters in the UTF-8 stream. All of this guarantees that systems that only care about bytes and some special ASCII characters like slash "/" (typically the case of the Unix VFS) can imediately "speak" Unicode.
o Sorting: sorting a UTF-8 string gives the same result as sorting a Unicode string. Again, not a single line of code to add.
And for your information, UTF-16 is obsolete, since it can represent all of the Unicode v3.0 character space. That is exactly why glibc 2.1 (and up) uses UTF-32. When the Windows people realized that in Windows 2K, they did the same "ugly hack": they started to use what is called "surrogates", i.e. encode one Unicode character with a sequence of 2 UTF-16 characters...
Re:UTF-8 suxx
Posted by: Anonymous Coward on November 14, 2004 08:22 AMo It is ASCII transparent: any sequence of ASCII bytes is a correct sequence of UTF-8 bytes that corresponds to the equivalent Unicode characters. Reciprocally, any non-ASCII Unicode character is guaranteed to be encoded using only non-ASCII characters in the UTF-8 stream. All of this guarantees that systems that only care about bytes and some special ASCII characters like slash "/" (typically the case of the Unix VFS) can imediately "speak" Unicode.
o Sorting: sorting a UTF-8 string gives the same result as sorting a Unicode string. Again, not a single line of code to add.
And for your information, UTF-16 is obsolete, since it can represent all of the Unicode v3.0 character space. That is exactly why glibc 2.1 (and up) uses UTF-32. When the Windows people realized that in Windows 2K, they did the same "ugly hack": they started to use what is called "surrogates", i.e. encode one Unicode character with a sequence of 2 UTF-16 characters...
#