Unicode is More than Latin-1++

This document tries to explain a few common misconceptions about Unicode. It does not try to fully explain Unicode, or even give a good introduction to Unicode. It should be taken as an encouragement to read more about Unicode, though.

Many misconceptions on Unicode stem from a basic misunderstanding of the scope of Unicode. It’s not just a larger charset like Latin-1. Since Unicode includes a lot of scripts from around the world, it has to solve a lot of problems that did not exist for western scripts available in Latin-1. These problems have always been there, but most programmers didn’t know about them because they were not relevant for the common charsets used in western countries. Programmers who had to extend their applications to work with foreign scripts (i18n) often had to find out about them the painful way. This lead to a lot of kludges and half-solutions to these problems. Unicode provides with solutions for these problems already so we don’t have to work around them anymore. But this also means we have to face them right from the beginning.

Size

In earlier times, all text did fit in 8 bit. Characters were indexed by integers, and all was well. Well, not really. In a multi-language application, all text had a charset descriptor attached to it describing what charset this string had to be interpreted in. Some charsets also required wide characters (16 bit, mostly), so the index functions needed to be rewritten or parameterized, and other provisions for such differences needed to be made.

Unicode luckily unifies the charset. There’s exactly one charset for all strings. The bad news is that this charset doesn’t fit in 8 or even 16 bit anymore. Unicode is defined on 21 bit code points (the name for the 21-bit scalar value), and it is unlikely that this will ever change. That’s a lot. An easy solution would be to use 32-bit wide characters (UTF-32), but since most strings continue to be from western scripts, this would waste a lot of space, and as we will see later, this still doesn’t give us the random access we want!

Another solution is to encode each code point in a variable-width sequence. A very clever solution for this is UTF-8, which remains even byte-compatible with ASCII text. Any ASCII text is valid UTF-8, and any UTF-8 string which uses only characters in ASCII will be viewable in an ASCII environment. Now this is nice. Other code points will be encoded in longer sequences. All of Latin-1 fits into two bytes, and even the biggest code points will fit into four bytes, thus UTF-8 will never occupy more space than UTF-32 would.

The main argument against UTF-8 is that it’s not fixed-width anymore, so we lose the random access to strings. This is actually not as bad as it may sound. UTF-8 encodes the length of each encoded code point in the first byte, so we get fast scanning. Also, most string operations don’t need random access, but rather cursors which they use to traverse the string from the beginning to the end. This is still possible at the same speed with UTF-8.

The only common string algorithms which require fast jumping within strings are the various optimized searching algorithms. But luckily, they work on byte strings, and it’s rather irrelevant whether those bytes encode one character per byte or not. For example, when searching for the byte string CE BB (GREEK SMALL LETTER LAMDA in UTF-8), I will find the sequence CE BB in the haystack string just as well. Thus, we don’t lose any speed here.

Combining Code Points

Earlier, I promised to explain why UTF-32 still does not allow real random access to strings. To understand this, it is important to understand that Unicode’s code points are not really equivalent to what the user thinks as characters, because Unicode includes so-called “combining characters.” For example, the character “ä” can be encoded in Unicode as LATIN SMALL LETTER A WITH DIAERESIS, but also as the two code points LATIN SMALL LETTER A and COMBINING DIAERESIS. The latter is in every respect just as much a correct encoding of “ä” as the former. Unicode has a name for such sequences of base code point and combining code points, it’s called a “grapheme cluster.” In a real string implementation, we would want to treat any grapheme cluster as a single character in every respect. Hence, even with UTF-32, we would lose fast random access to strings.

The two valid possible encodings for “ä” raise a question, though. How could we compare two strings for equality, if one can encode the same character in various possible ways? Luckily, Unicode has a solution for this, called normalization. Any two Unicode strings normalized using the same method will be identical if and only if the code point sequence is identical. So if the application normalizes all strings before using them, we retain fast byte comparisons of strings.

(It should be noted that not all possible combinations of code points and combining code points have a single code point equivalent.)

Casing Problems

Changing the case of a string used to be quite simple. Every character that could change case had a mapping to an upper case or lower case equivalent. This simple idea did break already when Latin-1 was introduced, as the upper case form of the German sharp s “ß” does not exist, and “ß” is capitalized in German as “SS”, which not only changes the size of the string, but is impossible to reliably downcase to “ß” again (as “ss” is rather common in German words as well).

Any application which claims to be “internationalized” has to cope with such problems. There are numerous ones. For example, the Greek lower case sigma has two forms, one at the middle or the beginning of words, and one at the end. So downcasing a capital sigma correctly is context-sensitive. And to complicate things even more, some Turkish languages upcase the letter “i” to “İ” (LATIN CAPITAL LETTER I WITH DOT ABOVE), a variant of “I” which retains the dot, so upcasing the letter “i” correctly is locale-dependent. What a mess.

Unicode provides for various degrees of help for such issues, and defines four kinds of case-related operations: The known up- and downcasing as well as two new ones, title casing and case folding.

Title case is the case to use for titles, where the first character is a capital letter and the rest of the word remains lowercase, which needs to be included since some Unicode characters upcase to two characters, only one of which needs to be capital case for titles, but both of which need to be capital case for true upcasing. For example “st”, LATIN SMALL LIGATURE ST, upcases to “ST” and titlecases to “St”.

Case folding tries to map characters which can represent the same character in different cases to the same grapheme cluster, as to allow easy case-insensitive searches. As an example, case-folded, the German “masse” will compare as equal to “maße.”

Newlines

A pretty infamous problem most programs and protocols encounter are the different line ending conventions used in different environments. Most notably, that includes UNIX-based environments, which traditionally use the ASCII Line Feed (0A) character, and the various DOS-derived systems, which use a Carriage Return, Line Feed (0D 0A) sequence. But one shouldn't forget the old Mac systems, which only used Carriage Return (0D), or even EBCDIC’s NEL (85). What a mess.

What is often forgotten is that there are actually two kinds of newlines. One kind separates lines, the other paragraphs. That these are not the same, and the resulting confusion, can often be seen in emails complaining about long lines. Indeed, the sending program probably did include newlines, but only sent the paragraph newlines in the mail. Unicode does standardize this mess. There are two code points, U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR, for exactly these purposes. Unicode applications should use these.

But the old newline codes still exist, and probably will for a long time. It's advised that all the code points mentioned above should be treated as a newline character by applications, as to maintain backwards compatibility. Especially on UNIX, the Line Feed character remains predominant, even in Unicode-aware systems.

Unicode Solves Problems

It can’t be stressed often enough that Unicode did not introduce all of these problems. While Unicode surely isn’t perfect and without problems, most of of the things explained here are inherent in the domain of internationalized programs. Unicode forces us to be aware of them. They won’t go away by not using Unicode, unless ASCII only is the alternative. Also, Unicode does provide a lot of help to solve these problems, help which we would have needed to hack together “solutions” if it weren’t for Unicode.