SXEmacs Lisp Reference Manual: Character Type

In XEmacs version 19, and in all versions of FSF GNU Emacs, a character in XEmacs Lisp is nothing more than an integer. This is yet another holdover from XEmacs Lisp’s derivation from vintage-1980 Lisps; modern versions of Lisp consider this equivalence a bad idea, and have separate character types. In XEmacs version 20, and of course all SXEmacs versions, the modern convention is followed, and characters are their own primitive types. This change was necessary in order for MULE, i.e. Asian-language, support to be correctly implemented.

Even in XEmacs version 20, remnants of the equivalence between characters and integers still exist; this is termed the char-int confoundance disease. In particular, many functions such as eq, equal, and memq have equivalent functions (old-eq, old-equal, old-memq, etc.) that pretend like characters are integers are the same. Byte code compiled under any version 19 Emacs will have all such functions mapped to their old- equivalents when the byte code is read into XEmacs 20. This is to preserve compatibility—Emacs 19 converts all constant characters to the equivalent integer during byte-compilation, and thus there is no other way to preserve byte-code compatibility even if the code has specifically been written with the distinction between characters and integers in mind.

Every character has an equivalent integer, called the character code. For example, the character A is represented as the integer 65, following the standard ASCII representation of characters. If SXEmacs was not compiled with MULE support, the range of this integer will always be 0 to 255—eight bits, or one byte. (Integers outside this range are accepted but silently truncated; however, you should most decidedly not rely on this, because it will not work under SXEmacs with MULE support.) When MULE support is present, the range of character codes is much larger. (Currently, 19 bits are used.)

FSF GNU Emacs uses kludgy character codes above 255 to represent keyboard input of ASCII characters in combination with certain modifiers. SXEmacs does not use this (a more general mechanism is used that does not distinguish between ASCII keys and other keys), so you will never find character codes above 255 in a non-MULE SXEmacs.

Individual characters are not often used in programs. It is far more common to work with strings, which are sequences composed of characters. See String Type.

The read syntax for characters begins with a question mark, followed by the character (if it’s printable) or some symbolic representation of it. In SXEmacs and XEmacs 20+, where characters are their own type, this is also the print representation. In XEmacs 19, however, where characters are really integers, the printed representation of a character is a decimal number.

This is also a possible read syntax for a character, but writing characters that way in Lisp programs is a very bad idea. You should always use the special read syntax formats that SXEmacs Lisp provides for characters.

The usual read syntax for alphanumeric characters is a question mark followed by the character; thus, ‘?A’ for the character A, ‘?B’ for the character B, and ‘?a’ for the character a.

;; Under SXEmacs:
?Q ⇒ ?Q    ?q ⇒ ?q
(char-int ?Q) ⇒ 81
;; Under XEmacs 19:
?Q ⇒ 81     ?q ⇒ 113

You can use the same syntax for punctuation characters, but it is often a good idea to add a ‘\’ so that the SXEmacs commands for editing Lisp code don’t get confused. For example, ‘?\ ’ is the way to write the space character. If the character is ‘\’, you must use a second ‘\’ to quote it: ‘?\\’.

SXEmacs always prints punctuation characters with a ‘\’ in front of them, to avoid confusion.

You can express the characters Control-g, backspace, tab, newline, vertical tab, formfeed, return, and escape as ‘?\a’, ‘?\b’, ‘?\t’, ‘?\n’, ‘?\v’, ‘?\f’, ‘?\r’, ‘?\e’, respectively. Their character codes are 7, 8, 9, 10, 11, 12, 13, and 27 in decimal. Thus,

;; Under SXEmacs:
?\a ⇒ ?\^G              ; C-g
(char-int ?\a) ⇒ 7
?\b ⇒ ?\^H              ; backspace, BS, C-h
(char-int ?\b) ⇒ 8
?\t ⇒ ?\t               ; tab, TAB, C-i
(char-int ?\t) ⇒ 9
?\n ⇒ ?\n               ; newline, LFD, C-j
?\v ⇒ ?\^K              ; vertical tab, C-k
?\f ⇒ ?\^L              ; formfeed character, C-l
?\r ⇒ ?\r               ; carriage return, RET, C-m
?\e ⇒ ?\^[              ; escape character, ESC, C-[
?\\ ⇒ ?\\               ; backslash character, \
;; Under XEmacs 19:
?\a ⇒ 7                 ; C-g
?\b ⇒ 8                 ; backspace, BS, C-h
?\t ⇒ 9                 ; tab, TAB, C-i
?\n ⇒ 10                ; newline, LFD, C-j
?\v ⇒ 11                ; vertical tab, C-k
?\f ⇒ 12                ; formfeed character, C-l
?\r ⇒ 13                ; carriage return, RET, C-m
?\e ⇒ 27                ; escape character, ESC, C-[
?\\ ⇒ 92                ; backslash character, \

These sequences which start with backslash are also known as escape sequences, because backslash plays the role of an escape character; this usage has nothing to do with the character ESC.

Control characters may be represented using yet another read syntax. This consists of a question mark followed by a backslash, caret, and the corresponding non-control character, in either upper or lower case. For example, both ‘?\^I’ and ‘?\^i’ are valid read syntax for the character C-i, the character whose value is 9.

Instead of the ‘^’, you can use ‘C-’; thus, ‘?\C-i’ is equivalent to ‘?\^I’ and to ‘?\^i’:

;; Under SXEmacs:
?\^I ⇒ ?\t   ?\C-I ⇒ ?\t
(char-int ?\^I) ⇒ 9
;; Under XEmacs 19:
?\^I ⇒ 9     ?\C-I ⇒ 9

There is also a character read syntax beginning with ‘\M-’. This sets the high bit of the character code (same as adding 128 to the character code). For example, ‘?\M-A’ stands for the character with character code 193, or 128 plus 65. You should not use this syntax in your programs. It is a holdover of yet another confoundance disease from earlier Emacsen. (This was used to represent keyboard input with the META key set, thus the ‘M’; however, it conflicts with the legitimate ISO-8859-1 interpretation of the character code. For example, character code 193 is a lowercase ‘a’ with an acute accent, in ISO-8859-1.)

Finally, the most general read syntax consists of a question mark followed by a backslash and the character code in octal (up to three octal digits); thus, ‘?\101’ for the character A, ‘?\001’ for the character C-a, and ?\002 for the character C-b. Although this syntax can represent any ASCII character, it is preferred only when the precise octal value is more important than the ASCII representation.

;; Under SXEmacs:
?\012 ⇒ ?\n        ?\n ⇒ ?\n        ?\C-j ⇒ ?\n
?\101 ⇒ ?A         ?A ⇒ ?A

;; Under XEmacs 19:
?\012 ⇒ 10         ?\n ⇒ 10         ?\C-j ⇒ 10
?\101 ⇒ 65         ?A ⇒ 65

A backslash is allowed, and harmless, preceding any character without a special escape meaning; thus, ‘?\+’ is equivalent to ‘?+’. There is no reason to add a backslash before most characters. However, you should add a backslash before any of the characters ‘()\|;'`"#.,’ to avoid confusing the SXEmacs commands for editing Lisp code. Also add a backslash before whitespace characters such as space, tab, newline and formfeed. However, it is cleaner to use one of the easily readable escape sequences, such as ‘\t’, instead of an actual whitespace character such as a tab.

8.4.3 Character Type