Next: , Up: MULE   [Contents][Index]


66.1 Internationalization Terminology

In internationalization terminology, a string of text is divided up into characters, which are the printable units that make up the text. A single character is (for example) a capital ‘A’, the number ‘2’, a Katakana character, a Hangul character, a Kanji ideograph (an ideograph is a “picture” character, such as is used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands of such ideographs in each language), etc. The basic property of a character is that it is the smallest unit of text with semantic significance in text processing.

Human beings normally process text visually, so to a first approximation a character may be identified with its shape. Note that the same character may be drawn by two different people (or in two different fonts) in slightly different ways, although the "basic shape" will be the same. But consider the works of Scott Kim; human beings can recognize hugely variant shapes as the "same" character. Sometimes, especially where characters are extremely complicated to write, completely different shapes may be defined as the "same" character in national standards. The Taiwanese variant of Hanzi is generally the most complicated; over the centuries, the Japanese, Koreans, and the People’s Republic of China have adopted simplifications of the shape, but the line of descent from the original shape is recorded, and the meanings and pronunciation of different forms of the same character are considered to be identical within each language. (Of course, it may take a specialist to recognize the related form; the point is that the relations are standardized, despite the differing shapes.)

In some cases, the differences will be significant enough that it is actually possible to identify two or more distinct shapes that both represent the same character. For example, the lowercase letters ‘a’ and ‘g’ each have two distinct possible shapes—the ‘a’ can optionally have a curved tail projecting off the top, and the ‘g’ can be formed either of two loops, or of one loop and a tail hanging off the bottom. Such distinct possible shapes of a character are called glyphs. The important characteristic of two glyphs making up the same character is that the choice between one or the other is purely stylistic and has no linguistic effect on a word (this is the reason why a capital ‘A’ and lowercase ‘a’ are different characters rather than different glyphs—e.g. ‘Aspen’ is a city while ‘aspen’ is a kind of tree).

Note that character and glyph are used differently here than elsewhere in SXEmacs.

A character set is essentially a set of related characters. ASCII, for example, is a set of 94 characters (or 128, if you count non-printing characters). Other character sets are ISO8859-1 (ASCII plus various accented characters and other international symbols), JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), GB2312 (Mainland Chinese Hanzi), etc.

The definition of a character set will implicitly or explicitly give it an ordering, a way of assigning a number to each character in the set. For many character sets, there is a natural ordering, for example the “ABC” ordering of the Roman letters. But it is not clear whether digits should come before or after the letters, and in fact different European languages treat the ordering of accented characters differently. It is useful to use the natural order where available, of course. The number assigned to any particular character is called the character’s code point. (Within a given character set, each character has a unique code point. Thus the word "set" is ill-chosen; different orderings of the same characters are different character sets. Identifying characters is simple enough for alphabetic character sets, but the difference in ordering can cause great headaches when the same thousands of characters are used by different cultures as in the Hanzi.)

A code point may be broken into a number of position codes. The number of position codes required to index a particular character in a character set is called the dimension of the character set. For practical purposes, a position code may be thought of as a byte-sized index. The printing characters of ASCII, being a relatively small character set, is of dimension one, and each character in the set is indexed using a single position code, in the range 1 through 94. Use of this unusual range, rather than the familiar 33 through 126, is an intentional abstraction; to understand the programming issues you must break the equation between character sets and encodings.

JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is of dimension two – every character is indexed by two position codes, each in the range 1 through 94. (This number “94” is not a coincidence; we shall see that the JIS position codes were chosen so that JIS kanji could be encoded without using codes that in ASCII are associated with device control functions.) Note that the choice of the range here is somewhat arbitrary. You could just as easily index the printing characters in ASCII using numbers in the range 0 through 93, 2 through 95, 3 through 96, etc. In fact, the standardized encoding for the ASCII character set uses the range 33 through 126.

An encoding is a way of numerically representing characters from one or more character sets into a stream of like-sized numerical values called words; typically these are 8-bit, 16-bit, or 32-bit quantities. If an encoding encompasses only one character set, then the position codes for the characters in that character set could be used directly. (This is the case with the trivial cipher used by children, assigning 1 to ‘A’, 2 to ‘B’, and so on.) However, even with ASCII, other considerations intrude. For example, why are the upper- and lowercase alphabets separated by 8 characters? Why do the digits start with ‘0’ being assigned the code 48? In both cases because semantically interesting operations (case conversion and numerical value extraction) become convenient masking operations. Other artificial aspects (the control characters being assigned to codes 0–31 and 127) are historical accidents. (The use of 127 for ‘DEL’ is an artifact of the "punch once" nature of paper tape, for example.)

Naive use of the position code is not possible, however, if more than one character set is to be used in the encoding. For example, printed Japanese text typically requires characters from multiple character sets – ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is indexed using one or more position codes in the range 1 through 94, so the position codes could not be used directly or there would be no way to tell which character was meant. Different Japanese encodings handle this differently – JIS uses special escape characters to denote different character sets; EUC sets the high bit of the position codes for JIS X 0208 and JIS X 0212, and puts a special extra byte before each JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings you will encounter in files are 7-bit or 8-bit encodings. There is one common 16-bit encoding, which is Unicode; this strives to represent all the world’s characters in a single large character set. 32-bit encodings are often used internally in programs, such as SXEmacs with MULE support, to simplify the code that manipulates them; however, they are not used externally because they are not very space-efficient.)

A general method of handling text using multiple character sets (whether for multilingual text, or simply text in an extremely complicated single language like Japanese) is defined in the international standard ISO 2022. ISO 2022 will be discussed in more detail later (see ISO 2022), but for now suffice it to say that text needs control functions (at least spacing), and if escape sequences are to be used, an escape sequence introducer. It was decided to make all text streams compatible with ASCII in the sense that the codes 0–31 (and 128-159) would always be control codes, never graphic characters, and where defined by the character set the ‘SPC’ character would be assigned code 32, and ‘DEL’ would be assigned 127. Thus there are 94 code points remaining if 7 bits are used. This is the reason that most character sets are defined using position codes in the range 1 through 94. Then ISO 2022 compatible encodings are produced by shifting the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit codes are available) into character codes 161 to 254.

Encodings are classified as either modal or non-modal. In a modal encoding, there are multiple states that the encoding can be in, and the interpretation of the values in the stream depends on the current global state of the encoding. Special values in the encoding, called escape sequences, are used to change the global state. JIS, for example, is a modal encoding. The bytes ‘ESC $ B’ indicate that, from then on, bytes are to be interpreted as position codes for JIS X 0208, rather than as ASCII. This effect is cancelled using the bytes ‘ESC ( B’, which mean “switch from whatever the current state is to ASCII”. To switch to JIS X 0212, the escape sequence ‘ESC $ ( D’. (Note that here, as is common, the escape sequences do in fact begin with ‘ESC’. This is not necessarily the case, however. Some encodings use control characters called "locking shifts" (effect persists until cancelled) to switch character sets.)

A non-modal encoding has no global state that extends past the character currently being interpreted. EUC, for example, is a non-modal encoding. Characters in JIS X 0208 are encoded by setting the high bit of the position codes, and characters in JIS X 0212 are encoded by doing the same but also prefixing the character with the byte 0x8F.

The advantage of a modal encoding is that it is generally more space-efficient, and is easily extendible because there are essentially an arbitrary number of escape sequences that can be created. The disadvantage, however, is that it is much more difficult to work with if it is not being processed in a sequential manner. In the non-modal EUC encoding, for example, the byte 0x41 always refers to the letter ‘A’; whereas in JIS, it could either be the letter ‘A’, or one of the two position codes in a JIS X 0208 character, or one of the two position codes in a JIS X 0212 character. Determining exactly which one is meant could be difficult and time-consuming if the previous bytes in the string have not already been processed, or impossible if they are drawn from an external stream that cannot be rewound.

Non-modal encodings are further divided into fixed-width and variable-width formats. A fixed-width encoding always uses the same number of words per character, whereas a variable-width encoding does not. EUC is a good example of a variable-width encoding: one to three bytes are used per character, depending on the character set. 16-bit and 32-bit encodings are nearly always fixed-width, and this is in fact one of the main reasons for using an encoding with a larger word size. The advantages of fixed-width encodings should be obvious. The advantages of variable-width encodings are that they are generally more space-efficient and allow for compatibility with existing 8-bit encodings such as ASCII. (For example, in Unicode ASCII characters are simply promoted to a 16-bit representation. That means that every ASCII character contains a ‘NUL’ byte; evidently all of the standard string manipulation functions will lose badly in a fixed-width Unicode environment.)

The bytes in an 8-bit encoding are often referred to as octets rather than simply as bytes. This terminology dates back to the days before 8-bit bytes were universal, when some computers had 9-bit bytes, others had 10-bit bytes, etc.


Next: , Up: MULE   [Contents][Index]