Difference between revisions of "Encodings and Regimes - Old Content"

From Wiki
Jump to navigation Jump to search
m (→‎Encodings: link (fixed))
(fixed typos, added links, markup and todo tags)
Line 27: Line 27:
 
http://fun.contextgarden.net/encodingtable/enctable.rb?ec,texnansi,8r,8a
 
http://fun.contextgarden.net/encodingtable/enctable.rb?ec,texnansi,8r,8a
  
''(I hope that the content of this section will soon move to a page on its own with more comprehensive overview of different encodings.)''
+
{{todo|I hope that the content of this section will soon move to a page on its own with more comprehensive overview of different encodings.}}
  
 
=== A note about the ec encoding ===
 
=== A note about the ec encoding ===
 
Ec encoding is also known under the names '''cork''' or '''T1''' (<code>\usepackage[T1]{fontenc}</code> in LaTeX). Its old version was '''dc''' (should not be used any more). Some of the glyph names in ec are old and deprecated, '''tex256''' uses the same set of glyphs, but the glyph names are compatible with Adobe, see also [ftp://tug.ctan.org/pub/tex-archive/info/fontname/tex256.enc tex256.enc] and [http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt Adobe Glyph List].
 
Ec encoding is also known under the names '''cork''' or '''T1''' (<code>\usepackage[T1]{fontenc}</code> in LaTeX). Its old version was '''dc''' (should not be used any more). Some of the glyph names in ec are old and deprecated, '''tex256''' uses the same set of glyphs, but the glyph names are compatible with Adobe, see also [ftp://tug.ctan.org/pub/tex-archive/info/fontname/tex256.enc tex256.enc] and [http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt Adobe Glyph List].
  
=== Searching for non-asci characters in Adobe ===
+
=== Searching for non-ASCII characters in Adobe Reader ===
  
Some characters (<code>\ccaron</code> - 'č' being of them for example) are not properly recognized by Adobe (especially by older versions) when searching or copying text from PDF documents. In order to help Adobe recognize the glyphs and treat them properly, add this piece of code to your source:
+
Some characters (<code>\ccaron</code> - 'č' being of them for example) are not properly recognized by Adobe (Acrobat) Reader (especially by older versions) when searching or copying text from PDF documents. In order to help Acrobat recognize the glyphs and treat them properly, add this piece of code to your source:
 
<texcode>
 
<texcode>
 
\input enco-pfr
 
\input enco-pfr
Line 42: Line 42:
 
</texcode>
 
</texcode>
  
At the time of writing this article, only il2 and ec are being supported, but support for other encodings can be added.
+
At the time of writing this article, only '''il2''' and '''ec''' are being supported, but support for other encodings can be added.
  
 
See also:
 
See also:
Line 71: Line 71:
 
You find output/font encodings in <tt>enco-*.tex</tt> files.
 
You find output/font encodings in <tt>enco-*.tex</tt> files.
  
See http://czyborra.com/charsets/iso8859.html for ISO standards.
+
See [http://czyborra.com/charsets/iso8859.html ISO 8859] for ISO standards.
  
 
==Typesetting in UTF-8==
 
==Typesetting in UTF-8==
Line 77: Line 77:
 
Use <texcode>\enableregime[utf]</texcode> in order to be able to typeset in unicode under ConTeXt.
 
Use <texcode>\enableregime[utf]</texcode> in order to be able to typeset in unicode under ConTeXt.
  
Unfortunately you must save your UTF-8 encoded files without BOM, because ConTeXt (or pdfTeX) doesn't ignore that but typesets the characters.
+
Unfortunately you must save your UTF-8 encoded files ''without'' BOM (byte order mark), because ConTeXt (or pdfTeX) doesn't ignore that but typesets the characters.
  
==Using non-ascii characters==
+
==Using non-ASCII characters==
  
As a TeX/LaTeX user you were probably told to use the accents in the following way (the example is taken from the TeXBOOK, page 24):
+
As a TeX/LaTeX user you were probably told to use the accents in the following way (the example is taken from the TeXbook, page 24):
 
<texcode>
 
<texcode>
 
Once upon a time, in a distant
 
Once upon a time, in a distant
Line 113: Line 113:
  
 
===How do I know which glyph name to use?===
 
===How do I know which glyph name to use?===
* use <texcode>\showcharacters</texcode>
+
* use <cmd>showcharacters</cmd>
* http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt
+
* [http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt Adobe glyph list]
* browse the ConTeXt source
+
* browse the ConTeXt [[source:enco-acc.tex|source]]
 
* ask someone to put the list of the available glyphs on the Wiki -) <b>(or simply volunteer for that!)</b>
 
* ask someone to put the list of the available glyphs on the Wiki -) <b>(or simply volunteer for that!)</b>
 +
{{todo|list of the available glyphs}}
  
 
==How it works?==
 
==How it works?==
  
'''Robert Ermers''' and '''[[User:adam|Adam]]''' provided a helpful explanation of how Characters are constructed in LaTeX and ConTeXt (in some discussion on the mailing list):
+
'''Robert Ermers''' and '''[[User:adam|Adam]]''' provided a helpful explanation of how characters are constructed in LaTeX and ConTeXt (in some discussion on the mailing list):
  
You know that all characters in a font have a number. If you type <code>a</code>, the font mechanism makes sure that you see an <context>a</context>. In reality the font shows you the character that is put on the numerical position of <code>a</code>. In the font dingbats for example, the character on that position is not an <context>a</context>, but a symbol.
+
You know that all characters in a font have a number. If you type <code>a</code>, the font mechanism makes sure that you see an <context>a</context>. In reality the font shows you the character that is put on the numerical position of <code>a</code>. In the font Dingbats for example, the character on that position is not an <context>a</context>, but a symbol.
  
===In Latex=== the combination <code>\"{a}</code> can mean two things:
+
===In LaTeX=== the combination <code>\"{a}</code> can mean two things:
 
* in most fonts: show the character on the a given numerical position, which means that there is one character <context>\"{a}</context>.
 
* in most fonts: show the character on the a given numerical position, which means that there is one character <context>\"{a}</context>.
  
 
* in some other fonts <code>\"{a}</code> means: combine <code>"</code> with <code>a</code> and make an <context>\"{a}</context>. This means that <code>"</code> is combined with the character on the numerical position of <code>a</code>. TeX does this very well and thus construes very acceptable diacritical signs like <code>\"{q}</code>, <code>\d{o}</code>, <code>\v{o}</code>, which do not exist in regular fonts.
 
* in some other fonts <code>\"{a}</code> means: combine <code>"</code> with <code>a</code> and make an <context>\"{a}</context>. This means that <code>"</code> is combined with the character on the numerical position of <code>a</code>. TeX does this very well and thus construes very acceptable diacritical signs like <code>\"{q}</code>, <code>\d{o}</code>, <code>\v{o}</code>, which do not exist in regular fonts.
  
If you have a font which contains <context>\"{q}</context>(<code>\"{q}</code>), <context>\d{o}</context>(<code>\d{o}</code>) or some other special characters, you may instruct TeX not to create the character, but rather to show the contents of a given numerical position in that font. That's what the .enc and .fd files under Latex are for.
+
If you have a font which contains <context>\"{q}</context>(<code>\"{q}</code>), <context>\d{o}</context>(<code>\d{o}</code>) or some other special characters, you may instruct TeX not to create the character, but rather to show the contents of a given numerical position in that font. That's what the .enc and .fd files under LaTeX are for.
  
 
That's also the reason there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. <code>\v{s}</code> and <code>\v{c}</code> that TeX does not have to create anew from two signs.
 
That's also the reason there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. <code>\v{s}</code> and <code>\v{c}</code> that TeX does not have to create anew from two signs.
  
===In ConTeXt===
+
===In ConTeXt=== the combination <code>\"{a}</code> means one thing: <code>\adiaeresis</code> (see [[source:enco-acc.tex|enco-acc]]). This <code>\adiaeresis</code> can mean one of two things, depending on the encoding:
the combination <code>\"{a}</code> means one thing: <code>\adiaeresis</code> (see <b>enco-acc</b>). This <code>\adiaeresis</code> can mean one of two things, depending on the encoding:
 
 
* Numerical position, or  
 
* Numerical position, or  
* The fallback case (defined in <b>enco-def</b>), where a diaeresis/umlaut is placed atop an <context>a</context> glyph. Hyphenation implications as Hans described.
+
* The fallback case (defined in [[source:enco-def.tex|enco-def]]), where a diaeresis/umlaut is placed atop an <context>a</context> glyph. Hyphenation implications as Hans described.
  
 
The interesting/helpful thing about ConTeXt is that internally, that glyph is given a consistent name, no matter how it is input or output. So, if you type <code>ä</code> in your given input regime, and that encoding is properly set, that numerical <code>ä</code> (e.g., character <code>#228</code> in the windows regime) is mapped to <code>\adiaeresis</code>.
 
The interesting/helpful thing about ConTeXt is that internally, that glyph is given a consistent name, no matter how it is input or output. So, if you type <code>ä</code> in your given input regime, and that encoding is properly set, that numerical <code>ä</code> (e.g., character <code>#228</code> in the windows regime) is mapped to <code>\adiaeresis</code>.
  
Wanna know what happens in UTF-8? Here's my 'simplified' explanation:
+
Wanna know what happens in '''UTF-8'''? Here's a 'simplified' explanation:
 
In a UTF-8 bytestream, that character <context>\"{a}</context> is signified by two bytes:
 
In a UTF-8 bytestream, that character <context>\"{a}</context> is signified by two bytes:
 
<code>0xC3</code>, <code>0xA4</code>. That first byte triggers a conversion of both bytes into two
 
<code>0xC3</code>, <code>0xA4</code>. That first byte triggers a conversion of both bytes into two
different bytes, the actual Unicode number, <code>0x00 0xE4</code> (or: <code>0, 228</code>). ConTeXt then looks into internal hashes set up (in this case, the <b>unic-000</b> vector), looks at the 228<sup>th</sup> element, and sees that it's <code>\adiaeresis</code>. Things then proceed as normal. :)  
+
different bytes, the actual Unicode number, <code>0x00 0xE4</code> (or: <code>0, 228</code>). ConTeXt then looks into internal hashes set up (in this case, the [[source:unic-000.tex|unic-000]] vector), looks at the 228<sup>th</sup> element, and sees that it's <code>\adiaeresis</code>. Things then proceed as normal. :)  
  
(It's also interesting to note that for PostScript and TrueType fonts, that number > name > number (glyph) mapping happens yet again in the driver. But all that is outside of TeX proper, so to say any more would be confusing.)
+
(It's also interesting to note that for PostScript and TrueType fonts, that number -> name -> number (glyph) mapping happens yet again in the driver. But all that is outside of TeX proper, so to say any more would be confusing.)
  
 
==External links==
 
==External links==
* http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin (to be moved to a better place/another page)
+
* [http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin Alphabets derived from the Latin] (to be moved to a better place/another page)
* http://www.eki.ee/letter/ Letter database: languages, character sets, names etc.
+
* [http://www.eki.ee/letter/ Letter database]: languages, character sets, names etc.
  
 
[[Category:Fonts]]
 
[[Category:Fonts]]
 
[[Category:International]]
 
[[Category:International]]

Revision as of 19:21, 14 January 2006

< Fonts >

The Unicode effort clearly shows that 256 characters cannot possibly contain the world's languages. However (with the exception of modern variants like Omega and XeTeX), TeX is an old system, and will only deal with 256 characters per font. Similarly, many "legacy" file encodings on current operating systems will attempt to shoehorn a set of characters into eight bytes.

As a result, you need to make a choice which input encoding (regime) or font/output encoding (encoding) you use.

Encodings

LaTeX users will probably know them under the name fontenc (\usepackage[T1]{fontenc} for example). As TeX can only handle 256 characters at once, it is important to choose the encoding which covers all the characters of your language, otherwise the hyphenation won't work for words with composite characters and most probably you won't be able to simply extract text from the resulted PDFs.

To enable ec encoding in Latin Modern for example, you can type:

\usetypescript[modern][ec]
\setupbodyfont[10pt,rm]

Some good choices for encodings are:

  • texnansi for Western European languages with only a small subset of additional accented characters (includes many other important glyphs)
  • ec for European languages with many accented characters
  • qx as a compromise between the two above, supposed to cover most Central European languages (more accented characters than texnansi and more additional glyphs in comparison to ec)
  • t5 for Vietnamese
  • cyr, t2a, t2b, t2c, ... (?) for Cyrillic
  • iso-8859-7/greeek/grk (?) for Greek (see Greek for more details)

Users of il2 and pl0 should consider moving to qx.

A simple overview of which characters are present in some of the most common encodings (ec, texnansi, 8r and 8a): http://fun.contextgarden.net/encodingtable/enctable.rb?ec,texnansi,8r,8a


TODO: I hope that the content of this section will soon move to a page on its own with more comprehensive overview of different encodings. (See: To-Do List)


A note about the ec encoding

Ec encoding is also known under the names cork or T1 (\usepackage[T1]{fontenc} in LaTeX). Its old version was dc (should not be used any more). Some of the glyph names in ec are old and deprecated, tex256 uses the same set of glyphs, but the glyph names are compatible with Adobe, see also tex256.enc and Adobe Glyph List.

Searching for non-ASCII characters in Adobe Reader

Some characters (\ccaron - 'č' being of them for example) are not properly recognized by Adobe (Acrobat) Reader (especially by older versions) when searching or copying text from PDF documents. In order to help Acrobat recognize the glyphs and treat them properly, add this piece of code to your source:

\input enco-pfr
\startencoding [ec]
  \usepdffontresource ec
\stopencoding

At the time of writing this article, only il2 and ec are being supported, but support for other encodings can be added.

See also:

Available Regimes

ConTeXt name(s)Official name(s)Remarks
il1ISO-8859-1, ISO Latin 1western european languages
win = windowsWindows CP 1252 (nearly ISO Latin 1)western european languages
latin2Pseudo ISO Latin 2see regi-lat.tex
il9ISO-8859-15, ISO Latin 9Latin-1 plus Euro
macMac Romanwestern european languages
ibmIBM PC DOSwestern european languages
grkISO-8859-7Greek
utfUTF-8Unicode, see below
vis = visciiVISCIIVietnamese
cp1251Windows CP 1251cyrillic
cp866, cp866navDOS CP 866cyrillic
koi8-r, koi8-u, koi8-ruKOI8cyrillic (russian, ukrainian, mixed)
maccyr, macukrMac Cyrilliccyrillic (russian, ukrainian)
cp855, cp866av, cp866mav, cp866tat, ctt, dbk, iso88595, isoir111, mik, mls, mnk, mos, ncc(several)rare cyrillic encodings, see regi-cyp.tex

A list of available language codes is in mult-sys.tex. You find output/font encodings in enco-*.tex files.

See ISO 8859 for ISO standards.

Typesetting in UTF-8

Use

\enableregime[utf]

in order to be able to typeset in unicode under ConTeXt.

Unfortunately you must save your UTF-8 encoded files without BOM (byte order mark), because ConTeXt (or pdfTeX) doesn't ignore that but typesets the characters.

Using non-ASCII characters

As a TeX/LaTeX user you were probably told to use the accents in the following way (the example is taken from the TeXbook, page 24):

Once upon a time, in a distant
  galaxy called \"O\"o\c c
there lived a computer
named R.~J. Drofnats.

The galaxy name will be shown as.

In ConTeXt, please try to avoid that backslashed character composition if possible (there are several good reasons for it - hyphenation etc.).

You have two alternatives:

Type the characters as you do in any other text editor

\enableregime[utf] % or any other supported regime

...

Once upon a time, in a distant
  galaxy called Ööç

Once you figure out what regime you need, you can simply type the characters as you do in any text editor (See above for the list of available regimes - some more will probably be added in the near future. If you don't find the one you would like to use, please ask on the mailing list)

Use glyph names

If you don't have the letter on your keyboard (or if you want some strange letters not supported by the regime you use, for example greek or cyrillic), you can access the glyphs by their names:

Once upon a time, in a distant
  galaxy called \Odiaeresis\odiaeresis\ccedilla

How do I know which glyph name to use?


TODO: list of the available glyphs (See: To-Do List)


How it works?

Robert Ermers and Adam provided a helpful explanation of how characters are constructed in LaTeX and ConTeXt (in some discussion on the mailing list):

You know that all characters in a font have a number. If you type a, the font mechanism makes sure that you see an . In reality the font shows you the character that is put on the numerical position of a. In the font Dingbats for example, the character on that position is not an , but a symbol.

===In LaTeX=== the combination \"{a} can mean two things:

  • in most fonts: show the character on the a given numerical position, which means that there is one character .
  • in some other fonts \"{a} means: combine " with a and make an . This means that " is combined with the character on the numerical position of a. TeX does this very well and thus construes very acceptable diacritical signs like \"{q}, \d{o}, \v{o}, which do not exist in regular fonts.

If you have a font which contains (\"{q}), (\d{o}) or some other special characters, you may instruct TeX not to create the character, but rather to show the contents of a given numerical position in that font. That's what the .enc and .fd files under LaTeX are for.

That's also the reason there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. \v{s} and \v{c} that TeX does not have to create anew from two signs.

===In ConTeXt=== the combination \"{a} means one thing: \adiaeresis (see enco-acc). This \adiaeresis can mean one of two things, depending on the encoding:

  • Numerical position, or
  • The fallback case (defined in enco-def), where a diaeresis/umlaut is placed atop an glyph. Hyphenation implications as Hans described.

The interesting/helpful thing about ConTeXt is that internally, that glyph is given a consistent name, no matter how it is input or output. So, if you type ä in your given input regime, and that encoding is properly set, that numerical ä (e.g., character #228 in the windows regime) is mapped to \adiaeresis.

Wanna know what happens in UTF-8? Here's a 'simplified' explanation: In a UTF-8 bytestream, that character is signified by two bytes: 0xC3, 0xA4. That first byte triggers a conversion of both bytes into two different bytes, the actual Unicode number, 0x00 0xE4 (or: 0, 228). ConTeXt then looks into internal hashes set up (in this case, the unic-000 vector), looks at the 228th element, and sees that it's \adiaeresis. Things then proceed as normal. :)

(It's also interesting to note that for PostScript and TrueType fonts, that number -> name -> number (glyph) mapping happens yet again in the driver. But all that is outside of TeX proper, so to say any more would be confusing.)

External links