Next: , Previous: , Up: Coding Systems   [Contents][Index]


66.6 ISO 2022

This section briefly describes the ISO 2022 encoding standard. A more thorough treatment is available in the original document of ISO 2022 as well as various national standards (such as JIS X 0202).

Character sets (charsets) are classified into the following four categories, according to the number of characters in the charset: 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means that although an ISO 2022 coding system may have variable width characters, each charset used is fixed-width (in contrast to the MULE character set and UTF-8, for example).

ISO 2022 provides for switching between character sets via escape sequences. This switching is somewhat complicated, because ISO 2022 provides for both legacy applications like Internet mail that accept only 7 significant bits in some contexts (RFC 822 headers, for example), and more modern "8-bit clean" applications. It also provides for compact and transparent representation of languages like Japanese which mix ASCII and a national script (even outside of computer programs).

First, ISO 2022 codified prevailing practice by dividing the code space into "control" and "graphic" regions. The code points 0x00-0x1F and 0x80-0x9F are reserved for "control characters", while "graphic characters" must be assigned to code points in the regions 0x20-0x7F and 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some circumstances must be assigned the graphic character "ASCII SPACE" and the control character "ASCII DEL" respectively.

The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left" and "graphic right", respectively, because of the standard method of displaying graphic character sets in tables with the high byte indexing columns and the low byte indexing rows. I don’t find it very intuitive, but these are called "registers".

An ISO 2022-conformant encoding for a graphic character set must use a fixed number of bytes per character, and the values must fit into a single register; that is, each byte must range over either 0x20-0x7F, or 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a character set by using both ranges at the same. This is why a standard character set such as ISO 8859-1 is actually considered by ISO 2022 to be an aggregation of two character sets, ASCII and LATIN-1, and why it is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a single character’s bytes must all be drawn from the same register; this is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO 2022-compatible encodings.

The reason for this restriction becomes clear when you attempt to define an efficient, robust encoding for a language like Japanese. Like ISO 8859, Japanese encodings are aggregations of several character sets. In practice, the vast majority of characters are drawn from the "JIS Roman" character set (a derivative of ASCII; it won’t hurt to think of it as ASCII) and the JIS X 0208 standard "basic Japanese" character set including not only ideographic characters ("kanji") but syllabic Japanese characters ("kana"), a wide variety of symbols, and many alphabetic characters (Roman, Greek, and Cyrillic) as well. Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not suited to programming; thus the inclusion of ASCII in the standard Japanese encodings.

For normal Japanese text such as in newspapers, a broad repertoire of approximately 3000 characters is used. Evidently this won’t fit into one byte; two must be used. But much of the text processed by Japanese computers is computer source code, nearly all of which is ASCII. A not insignificant portion of ordinary text is English (as such or as borrowed Japanese vocabulary) or other languages which can represented at least approximately in ASCII, as well. It seems reasonable then to represent ASCII in one byte, and JIS X 0208 in two. And this is exactly what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is invoked to the GL register, and JIS X 0208 is invoked to the GR register. Thus, each byte can be tested for its character set by looking at the high bit; if set, it is Japanese, if clear, it is ASCII. Furthermore, since control characters like newline can never be part of a graphic character, even in the case of corruption in transmission the stream will be resynchronized at every line break, on the order of 60-80 bytes. This coding system requires no escape sequences or special control codes to represent 99.9% of all Japanese text.

Note carefully the distinction between the character sets (ASCII and JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The JIS X 0208 character set is used in three different encodings for Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is always clear), in EUC-JP it is invoked into GR (setting the high bit in the process), and in Shift JIS the high bit may be set or reset, and the significant bits are shifted within the 16-bit character so that the two main character sets can coexist with a third (the "halfwidth katakana" of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a version of the ISO-2022 coding system.

In order to systematically treat subsidiary character sets (like the "halfwidth katakana" already mentioned, and the "supplementary kanji" of JIS X 0212), four further registers are defined: G0, G1, G2, and G3. Unlike GL and GR, they are not logically distinguished by internal format. Instead, the process of "invocation" mentioned earlier is broken into two steps: first, a character set is designated to one of the registers G0-G3 by use of an escape sequence of the form:

        ESC [I] I F

where I is an intermediate character or characters in the range 0x20 - 0x3F, and F, from the range 0x30-0x7Fm is the final character identifying this charset. (Final characters in the range 0x30-0x3F are reserved for private use and will never have a publicly registered meaning.)

Then that register is invoked to either GL or GR, either automatically (designations to G0 normally involve invocation to GL as well), or by use of shifting (affecting only the following character in the data stream) or locking (effective until the next designation or locking) control sequences. An encoding conformant to ISO 2022 is typically defined by designating the initial contents of the G0-G3 registers, specifying a 7 or 8 bit environment, and specifying whether further designations will be recognized.

Some examples of character sets and the registered final characters F used to designate them:

94-charset

ASCII (B), left (J) and right (I) half of JIS X 0201, ...

96-charset

Latin-1 (A), Latin-2 (B), Latin-3 (C), ...

94x94-charset

GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...

96x96-charset

none for the moment

The meanings of the various characters in these sequences, where not specified by the ISO 2022 standard (such as the ESC character), are assigned by ECMA, the European Computer Manufacturers Association.

The meaning of intermediate characters are:

        $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
        ( [0x28]: designate to G0 a 94-charset whose final byte is F.
        ) [0x29]: designate to G1 a 94-charset whose final byte is F.
        * [0x2A]: designate to G2 a 94-charset whose final byte is F.
        + [0x2B]: designate to G3 a 94-charset whose final byte is F.
        , [0x2C]: designate to G0 a 96-charset whose final byte is F.
        - [0x2D]: designate to G1 a 96-charset whose final byte is F.
        . [0x2E]: designate to G2 a 96-charset whose final byte is F.
        / [0x2F]: designate to G3 a 96-charset whose final byte is F.

The comma may be used in files read and written only by MULE, as a MULE extension, but this is illegal in ISO 2022. (The reason is that in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned the value SPACE, and 0x7F assigned the value DEL.)

Here are examples of designations:

        ESC ( B :              designate to G0 ASCII
        ESC - A :              designate to G1 Latin-1
        ESC $ ( A or ESC $ A : designate to G0 GB2312
        ESC $ ( B or ESC $ B : designate to G0 JISX0208
        ESC $ ) C :            designate to G1 KSC5601

(The short forms used to designate GB2312 and JIS X 0208 are for backwards compatibility; the long forms are preferred.)

To use a charset designated to G2 or G3, and to use a charset designated to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 into GL. There are two types of invocation, Locking Shift (forever) and Single Shift (one character only).

Locking Shift is done as follows:

        LS0 or SI (0x0F): invoke G0 into GL
        LS1 or SO (0x0E): invoke G1 into GL
        LS2:  invoke G2 into GL
        LS3:  invoke G3 into GL
        LS1R: invoke G1 into GR
        LS2R: invoke G2 into GR
        LS3R: invoke G3 into GR

Single Shift is done as follows:

        SS2 or ESC N: invoke G2 into GL
        SS3 or ESC O: invoke G3 into GL

The shift functions (such as LS1R and SS3) are represented by control characters (from C1) in 8 bit environments and by escape sequences in 7 bit environments.

(#### Ben says: I think the above is slightly incorrect. It appears that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and ESC O behave as indicated. The above definitions will not parse EUC-encoded text correctly, and it looks like the code in mule-coding.c has similar problems.)

Evidently there are a lot of ISO-2022-compliant ways of encoding multilingual text. Now, in the world, there exist many coding systems such as X11’s Compound Text, Japanese JUNET code, and so-called EUC (Extended UNIX Code); all of these are variants of ISO 2022.

In MULE, we characterize a version of ISO 2022 by the following attributes:

  1. The character sets initially designated to G0 thru G3.
  2. Whether short form designations are allowed for Japanese and Chinese.
  3. Whether ASCII should be designated to G0 before control characters.
  4. Whether ASCII should be designated to G0 at the end of line.
  5. 7-bit environment or 8-bit environment.
  6. Whether Locking Shifts are used or not.
  7. Whether to use ASCII or the variant JIS X 0201-1976-Roman.
  8. Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976.

(The last two are only for Japanese.)

By specifying these attributes, you can create any variant of ISO 2022.

Here are several examples:

ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
        1. G0 <- ASCII, G1..3 <- never used
        2. Yes.
        3. Yes.
        4. Yes.
        5. 7-bit environment
        6. No.
        7. Use ASCII
        8. Use JIS X 0208-1983
ctext -- X11 Compound Text
        1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
        2. No.
        3. No.
        4. Yes.
        5. 8-bit environment.
        6. No.
        7. Use ASCII.
        8. Use JIS X 0208-1983.
euc-china -- Chinese EUC.  Often called the "GB encoding", but that is
technically incorrect.
        1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
        2. No.
        3. Yes.
        4. Yes.
        5. 8-bit environment.
        6. No.
        7. Use ASCII.
        8. Use JIS X 0208-1983.
ISO-2022-KR -- Coding system used in Korean email.
        1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
        2. No.
        3. Yes.
        4. Yes.
        5. 7-bit environment.
        6. Yes.
        7. Use ASCII.
        8. Use JIS X 0208-1983.

MULE creates all of these coding systems by default.


Next: , Previous: , Up: Coding Systems   [Contents][Index]