Archive

Foreign Characters

This page describes how to set up your system to use a suitable default encoding and character set, and to configure X to give you a keyboard map which will allow you to enter a selection of symbols and foreign characters.

Character sets, Unicode and UTF-8

First, a bit of background. You can skip this if you don’t want to know the details, but it may help to explain what’s going on.

Computers ultimately store all their information, including text, as numbers. In order to make it easy for people to read these numbers, they typically go through a sequence of conversions. So, the number 65 on my machine represents the capital letter A. When the computer comes to display it on the screen, it looks up the capital-letter-A symbol in a font, and draws the font glyph corresponding to it: A.

The first stage (65 -> A) of this conversion is a character set conversion, and the conversion table is known as a character set (or sometimes a code page). It is this conversion which we are concerned with on this page.

If you go back 30 years or so, you will find that most computers used 7-bit ASCII as their character set. This encoded the upper and lower case Roman alphabet (A-Z, a-z), the ten Arabic digits (0-9), and a small set of punctuation and simple mathematical symbols (!?”$%&.,:;+=- among them) as the numbers 0-127. There were also some “special” characters for controlling things like teletype terminals. This set of characters was sufficient for programming in most languages, and also covered the majority of (non-mathematical) text in English. However, it was severely limited for non-English use (and even for some English loan-words, such as naïve or rôle).

The solution to this problem was to extend the 7-bit ASCII table to 8 bits, giving an additional 128 characters possible in the character map. Unfortunately, there are far more than 128 additional characters in use in the world, and this meant that everyone extended their ASCII character map in a different way, and for different regions of the world. This meant that for most languages based on the Roman alphabet, you could see most of the language correctly, but if you used a different character set to the author, you’d see odd symbols stuck in every so often when there was an accented character. Some scripts, such as Japanese or Chinese, couldn’t use this scheme either, as they had far more than even 256 characters in a single script.

The ultimate unifying solution to the problem is Unicode, which is a single character set that has enough space to cover all known written human languages, and a broad variety of different symbols. Unicode uses a 32-bit encoding, so every character has a unique 32-bit value associated with it. The first 128 characters of Unicode are identical to the traditional 7-bit ASCII encoding. The next 128 characters are identical to characters 128-255 of the ISO Latin-1 (ISO 8859-1) encoding. So, for example, the Unicode character “GREEK SMALL LETTER THETA”, is usually rendered as θ (if you have the right font), is encoded in Unicode as the 4-byte number 000003B8, and is usually written U+03B8 when discussing the Unicode character.

Now, since raw Unicode takes four bytes per character instead of just the one, using raw unicode takes up a lot more space – especially since for most usages, at least half of the bytes will be zeroes. Thus, to make Unicode text smaller, there are seven methods of encoding a Unicode character in (typically) a smaller number of bytes. The one which is most commonly used is called UTF-8, and it is this which we will be setting up in subsequent sections.

UTF-8 has one major advantage for speakers of Western European languages, which is that for unaccented Roman letters, text in UTF-8 is identical, byte-for-byte, with text in 7-bit ASCII (and all other ISO-8859-x encoded text). This means that systems set up to use UTF-8 will read plain 7-bit ASCII text perfectly, and most Western European text will be broadly comprehensible to humans, albeit with some encoding errors.

Setting UTF-8 locales

In Linux, the character set currently in use is governed on a per-user basis by the locale settings. You can check what settings you currently have by typing locale:

hrm@selene:hrm $ locale LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF-8" LC_ALL=

The above output, from my main desktop machine, shows that I am currently using the British version of the English-language locale for all my settings, and that I am using UTF-8 encodings. These encodings can be changed by setting the appropriate system variable:

hrm@selene:hrm $ export LANG=en_GB hrm@selene:hrm $ locale LANG=en_GB LC_CTYPE="en_GB" LC_NUMERIC="en_GB" LC_TIME="en_GB" LC_COLLATE="en_GB" LC_MONETARY="en_GB" LC_MESSAGES="en_GB" LC_PAPER="en_GB" LC_NAME="en_GB" LC_ADDRESS="en_GB" LC_TELEPHONE="en_GB" LC_MEASUREMENT="en_GB" LC_IDENTIFICATION="en_GB" LC_ALL=

The LC_ variables are locale settings and, unless you have expicitly set one or more of them, they take the same setting as the LANG variable. To ensure that your applications know that you want to use UTF-8, set the variable LANG=en_GB.UTF-8. You can set this either in your .bashrc file, or in /etc/profile for a system-wide configuration, or most distributions have a more reliable system-wide setting for LANG. For example, on Debian, you can use dpkg-reconfigure locales to configure the system-wide locale.

To configure the system-wide locale on Debian:

#In a terminal, as root, type dpkg-reconfigure locales. If this returns a message saying that the ‘locales’ package is not installed, install it (eg aptitude install locales), then try again. #In dpkg-reconfigure, select the locales for which you want to generate locale files. Start with en_GB.UTF-8 so that you have UTF-8 for British-English. If you want to be able to read/write in other languages, select the UTF-8 version of those too; for example, to get Norwegian, select no_NO.UTF-8. Press Enter to continue. #Specify which of the selected locales you want to use as your default (probably en_GB.UTF-8). Press Enter to continue. #Wait while dpkg-reconfigure generates the files for the locales that you selected. #When dpkg has finished generating the files, if you selected a default locale, check that it created the file /etc/environment, containing the LANG environment variable; eg LANG=en_GB.UTF-8 if en_GB.UTF-8 is your default locale. #Reboot the computer. #When the computer has rebooted, run the locale command. If all has gone successfully, the response will show LANG=en_GB.UTF-8 (or whatever you set your default locale to).

After setting your locale to en_GB.UTF-8, applications which are capable of understanding locales will read and write data in UTF-8. Note that some applications (particularly editors) automatically attempt to work out the encoding of a file from its content, and won’t necessarily tell you which encoding they are actually using. Some applications may need to be told explicitly to use UTF-8, even though you have set en_GB.UTF-8 (X-Chat seems to be one of these sometimes).

Using these new characters

Now that your system will understand the concept of UTF-8 Unicode characters, you need some way of generating them. Now, since there are a few tens of thousands of characters in Unicode at present, it is infeasible to be able to type all of them from a single keyboard. However, you can configure your system so that you can type a useful subset of them easily.

In X, you can set up your keyboard to give you most Western European symbols (including the Euro € symbol and accented characters) by putting the following in your .xsession file:

setxkbmap -symbols 'en_US(pc105)+gb'

This uses the standard US-layout 105-key keyboard, with standard extensions for a GB-layout keyboard. You can access the additional symbols on the keyboard by holding down the right Alt key (usually marked AltGr, for Alternate Group). So, to type æ for example, hold AltGr and hit the A key. ø is AltGr+o. Other useful symbols are the Euro symbol, €, which is AltGr+4, and ©, which is AltGr+Shift+c.

With this keyboard layout, you can also type accented characters using dead keys. A dead key is one which appears to do nothing, but which modifies the next key to be pressed. So, to type the ï in “naïve”, hold down AltGr, press [[,|release AltGr, then type i. The AltGr+[ is the dead key for an umlaut or diaeresis, and can be applied to most letters you would expect to see one over. The full set of dead keys in the en_US(pc105)+gb configuration is:

AltGr+;

acute accent

á ú

AltGr+’

circumflex accent

ô î

AltGr+#

grave accent

à è

AltGr+[

umlaut

ö ü

AltGr+]]

tilde

õ ñ

AltGr+=

cedilla

ç ņ

AltGr+Shift+;

double acute accent

ő

AltGr+Shift+’

caron / hacek

č ň š

AltGr+Shift+#

breve

ă

AltGr+Shift+[

ring

å

AltGr+Shift+]

macron / bar

ō ī

AltGr+Shift+=

hook

ę ų į

AltGr+Shift+/

dot over

ė

(Note that some of the characters in the above table may not come out right, since HTML doesn’t have proper character entity references for them).

You can get a PostScript file showing all of the symbols available with the AltGr key using xkbprint:

$ xkbprint -lg 2 :0.0

The resulting file is called server-0_0.ps.

Other keyboard configurations

Greek: setxkbmap -symbols ‘en_US(pc105)+el’

Cyrillic: setxkbmap -symbols ‘en_US(pc105)+ru’

These both give you an ordinary Roman keyboard, with the additional letters available through the use of AltGr, so they are most useful for occasional use of the characters. If you want to be able to type normally in Cyrillic or Greek, then a full country-specific keyboard map is probably what you want (and is outside the scope of this page).

Note that the Greek keyboard map above positions the Greek letters on the keys with the equivalent Roman letters, but the Cyrillic map uses an entirely different ordering of letters on the keys (equivalent to a standard Russian typewriter keyboard?)

Problems

  • I’ve set my locale to en_GB.UTF-8 as described above, but now I only get “3” when I press Shift+3 instead of the pound sterling symbol.

    • The pound-sterling symbol is available with the key combination AltGr+Shift+3.

  • I see the right characters, but they have an  before them, so I see “naÂïve”.

    • The application displaying the character isn’t using UTF-8 properly. It is attempting to display the character as ISO-8859-1 (the usual character encoding in Western Europe), and displaying a single two-byte UTF-8 character as two single-byte ISO-8859-1 characters. Check your locale settings, and check that the application is using UTF-8.
    • If you are logged into another machine and accessing it through a terminal, or a screen session, you will need to ensure that both machines are UTF-8 enabled. screen has a -U option to tell it to use Unicode.

  • I can use the dead keys which don’t need a shift key (acute, grave, circumflex, umlaut, tilde, cedilla), but the ones that need shift don’t work (caron, breve, ring, etc.)
    • You need to press AltGr before pressing the shift or the dead key itself.

  • I get a square with four little numbers in it instead of the letter I wanted.
    • These are generated automatically by X when it can’t find a suitable character for the code you’ve asked for. Try using a different font, or installing a better-internationalised version of the font you are trying to use. With several tens of thousands of Unicode characters defined, not every font has glyphs for every character.
  • I changed my character set to en_GB.UTF-8 and now some characters don’t display properly, and mutt looks like this.

    • This is a problem with the terminal setting on your system. Check the output of echo $TERM. If it says anything other than xterm, try running mutt using a standard xterm and you should see the correct characters.

Compose Key

It is possible to use setxkbmap to setup one unused modifier key to be a Compose key.

For example,

setxkbmap -option compose:menu

sets the “menu” key on a Windows keyboard to be the Compose key.

Once this is set up you can press Compose and then the next two character keys you type kind of get merged together.

So to type è (e grave), you type Compose, then e, then backtick (`).

 é = e + ' (ASCII apostrophe classically sloped like an acute accent, but doesn't on modern systems so that it can serve as a single quote too)  è = e + `  ë = e + "  ê = e + ^  æ = a + e  ß = s + s  ñ = n + ~  ç = c + ,  å = a + * (Ok, that one is a bit harder to fathom)

But that’s not all. You can compose almost anything!

  • ² = ^ + 2 · (decimal dot) = ^ + . ¿ = ? + ?

    « = < + < € = C + = © = O + c ÷ = : + – × (cross product) = x + x µ = / + u

It’s incredibly intuitive and you probably won’t need this table of combinations because you can work out many combinations just by examining what keys look like they should work together.

Leave a Reply