Encodings and Regimes - Old Content

From Wiki
Revision as of 10:34, 14 July 2005 by Hraban (talk | contribs) (link to Fonts)
Jump to navigation Jump to search

< Fonts >

The Unicode effort clearly shows that 256 characters cannot possibly contain the world's languages. However (with the exception of modern variants like Omega and XeTeX), TeX is an old system, and will only deal with 256 characters per font. Similarly, many "legacy" file encodings on current operating systems will attempt to shoehorn a set of characters into eight bytes.

As a result, you need to make a choice which input encoding (regime) or font/output encoding (encoding) you use.

Available Regimes

ConTeXt name(s)Official name(s)Remarks
il1ISO-8859-1, ISO Latin 1western european languages
win = windowsWindows CP 1250 (nearly ISO Latin 1)western european languages
latin2Pseudo ISO Latin 2see regi-lat.tex
il9ISO-8859-15, ISO Latin 9Latin-1 plus Euro, not in default distribution
macMac Romanwestern european languages
ibmIBM PC DOSwestern european languages
grkISO-8859-7Greek
utfUTF-8Unicode, see below
vis = visciiVISCIIVietnamese
cp1251Windows CP 1251cyrillic
cp866, cp866navDOS CP 866cyrillic
koi8-r, koi8-u, koi8-ruKOI8cyrillic (russian, ukrainian, mixed)
maccyr, macukrMac Cyrilliccyrillic (russian, ukrainian)
cp855, cp866av, cp866mav, cp866tat, ctt, dbk, iso88595, isoir111, mik, mls, mnk, mos, ncc(several)rare cyrillic encodings, see regi-cyp.tex

A list of available language codes is in mult-sys.tex. You find output/font encodings in enco-*.tex files.

Typesetting in UTF-8

Use

\enableregime[utf]

in order to be able to typeset in unicode under ConTeXt.

How it works?

Robert Ermers and Adam provided a helpful explanation of how Characters are constructed in LaTeX and ConTeXt (in some discussion on the mailing list):

You know that all characters in a font have a number. If you type a, the font mechanism makes sure that you see an . In reality the font shows you the character that is put on the numerical position of a. In the font dingbats for example, the character on that position is not an , but a symbol.

===In Latex=== the combination \"{a} can mean two things:

  • in most fonts: show the charachter on the a given numerical position, which means that there is one character .
  • in some other fonts \"{a} means: combine " with a and make an . This means that " is combined with the character on the numerical position of a. TeX does this very well and thus construes very acceptable diacritical signs like \"{q}, \d{o}, \v{o}, which do not exist in regular fonts.

If you have a font which contains \"{q}, \d{o} or some other special characters, you may instruct TeX not to create the character, but rather to show the contents of a given numerical position in that font. That's what the .enc and .fd files under Latex are for.

That's also the reason there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. \v{s} and \v{c} that TeX does not have to create anew from two signs.

In ConTeXt

the combination \"{a} means one thing: \adiaeresis (see enco-acc). This \adiaeresis can mean one of two things, depending on the encoding:

  • Numerical position, or
  • The fallback case (defined in enco-def), where a diaeresis/umlaut is placed atop an glyph. Hyphenation implications as Hans described.

The interesting/helpful thing about ConTeXt is that internally, that glyph is given a consistent name, no matter how it is input or output. So, if you type ä in your given input regime, and that encoding is properly set, that numerical ä (e.g., character #228 in the windows regime) is mapped to \adiaeresis.

Wanna know what happens in UTF-8? Here's my 'simplified' explanation: In a UTF-8 bytestream, that character is signified by two bytes: 0xC3, 0xA4. That first byte triggers a conversion of both bytes into two different bytes, the actual Unicode number, 0x00 0xE4 (or: 0, 228). ConTeXt then looks into internal hashes set up (in this case, the unic-000 vector), looks at the 228th element, and sees that it's \adiaeresis. Things then proceed as normal. :)

(It's also interesting to note that for PostScript and TrueType fonts, that number > name > number (glyph) mapping happens yet again in the driver. But all that is outside of TeX proper, so to say any more would be confusing.)