SXEmacs User’s Manual: Mule Intro

17.1 What is Mule?

Mule is the MUltiLingual Extension to SXEmacs. It provides facilities not only for handling text written in many different languages, but in fact multilingual texts containing several languages in the same buffer. This goes beyond the simple facilities offered by Unicode for representation of multilingual text. Mule also supports input methods, composing display using fonts in various different encodings, changing character syntax and other editing facilities to correspond to local language usage, and more.

The most obvious problem is that of the different character coding systems used by different languages. ASCII supplies all the characters needed for most computer programming languages and US English (it lacks the currency symbol for British English), but other Western European languages (French, Spanish, German) require more than 96 code positions for accented characters. In fact, even with 8 bits to represent 96 more character (including accented characters and symbols such as currency symbols), some languages’ alphabets remain incomplete (Croatian, Polish). (The 64 "missing characters" are reserved for control characters.) Furthermore, many European languages have their own alphabets, which must conflict with the accented characters since the ASCII characters are needed for computer interaction (error and log messages are typically in ASCII).

For economy of space, historical practice has been for each language to establish its own encoding for the characters it needs. This allows most European languages to represented with one octet (byte) per character. However, many Asian languages have thousands of characters and require two or more octets per character. For multilingual purposes, the ISO 2022 standard establishes escape codes that allow switching encodings in midstream. (It’s also ISO 2022 that establishes the standard that code points 0-31 and 128-159 are control codes.)

However, this is error-prone and complex for internal processing. For this reason SXEmacs uses an internal coding system which can encode all of the world’s scripts. Unfortunately, for historical reasons, this code is not Unicode, although we are moving in that direction.

SXEmacs translates between the internal character encoding and various other coding systems when reading and writing files, when exchanging data with subprocesses, and (in some cases) in the C-q command (see below). The internal encoding is never visible to the user in a production SXEmacs, but unfortunately the process cannot be completely transparent to the user. This is because the same ranges of octets may represent 1-octet ISO-8859-1 (which is satisfactory for most Western European use prior to the introduction of the Euro currency), 1-octet ISO-8859-15 (which substitutes the Euro for the rarely used "generic currency" symbol), 1-octet ISO-8859-5 (Cyrillic), or multioctet EUC-JP (Japanese). There’s no way to tell without being able to read!

A number of heuristics are incorporated in Mule for automatic recognition, there are facilities for the user to set defaults, and where necessary (rarely, we hope) to set coding systems directly.

The command C-h h (view-hello-file) displays the file etc/HELLO, which shows how to say “hello” in many languages. This illustrates various scripts.

Keyboards, even in the countries where these character sets are used, generally don’t have keys for all the characters in them. So SXEmacs supports various input methods, typically one for each script or language, to make it convenient to type them.

The prefix key C-x RET is used for commands that pertain to world scripts, coding systems, and input methods.