Difference between revisions of "XML"

From Wiki
Jump to navigation Jump to search
m (→‎General Information: update link to manual page)
m (→‎Specific Commands: correct typo)
 
(27 intermediate revisions by 7 users not shown)
Line 1: Line 1:
< [[Main Page]] | [[DocBook]] | [[MathML]] | [[Formatting Objects]] >
+
__TOC__
  
If you want to get more from your code than just a PDF (or DVI) output, e.g. HTML, or if you need a good typesetting machine for your XML code, you're right with ConTeXt.
+
=Introduction=
  
Handling XML in ConTeXt has improved dramatically with the advent of MKIV. A new infrastructure, based on Lua, makes typesetting, manipulating, filtering, reusing XML much much easier than before. Unfortunately, this means that most of the existing documentation is now obsolete. As a rule of thumb: in general, the "old" MKII code uses upper-case <tt>XML</tt> in its commands, the new MKIV code uses lower-case <tt>xml</tt>. 
+
Handling XML in ConTeXt has improved dramatically with the advent of MkIV.
  
Here are some links to existing docs:
+
The new Lua–based infrastructure makes typesetting, manipulating, filtering, and reusing XML much much easier than before.
  
==Documents about XML in MKIV==
+
Unfortunately, this means that most of the existing documentation is now obsolete.
===General Information===
 
*[http://pragma-ade.com/show-man-44.htm xml-mkiv.pdf]
 
* [[TEI_xml| TEI xml]] (typesetting editions encoded in TEI xml)
 
  
===Processing XML with lua===
+
In general, old MkII code includes the uppercase <tt>XML</tt> string in its commands (as in {{cmd|getXMLcode|[name])}}, while new MkIV code uses lowercase <tt>xml</tt> (as in {{cmd|xmlflush|{#1}}}).
* [[XML_Lua| XML in Lua]] (manipulating xml in Lua)
 
===XHTML in MKIV===
 
* [http://dl.contextgarden.net/myway/tas/xhtml.pdf Thomas' MyWay on processing XHTML with MKIV]
 
  
 +
==Before You Start==
  
==Documents about XML in MKII (obsolete)==
+
It might be obvious, but there are two basic requirements to typeset XML sources with ConTeXt:
===XML/ConTeXt in general===
 
* [[manual:example.pdf|XML in ConTeXt]] by Pragma (2001)
 
* [http://www.leverkruid.eu/context/index.html XML DocBook in ConTeXt] by Simon Pepping
 
* [http://getfo.sourceforge.net/context_xml/index.html XML ConTeXt] by Paul Tremblay
 
* [http://www.pragma-ade.com/show-mag-9.htm Dealing with XML] by Pragma (about XML, XSLT and typesetting without TeX code)
 
* XML Basics: [[Mixing_XML_and_ConTeXt]] using the pre-defined ContML vocabulary
 
  
===Additions and Details of XML/ConTeXt===
+
# Familiarity with XML. You don’t have to type XML directly, but ConTeXt isn’t able to compile well–formed XML.<ref>If this is all Greek to you, consider it as incorrect XML.</ref>
 +
# At least, some knowledge of ConTeXt commands, since otherwise formatting what you select from the XML source would be impossible.
 +
 
 +
XML is way more powerful than being source format to typeset with ConTeXt. They are also completely independent from each other. It is important to deal with XML first without seeing it through ConTeXt lenses.
 +
 
 +
As for typing directly XML sources, there are some lightweight tagging (or markup) languages, such as AsciiDoc or Markdown.<ref>For a detailed list, see [https://en.wikipedia.org/wiki/Lightweight_markup_language#Comparison_of_language_features a feature comparison list in Wikipedia].</ref> There are tools ([https://pandoc.org Pandoc] being just one of them) that generate XML from these lightweight markup formats.
 +
It is not entirely impossible that in some cases these tools might generate wrong XML format (due to bugs in them). In that case, you will have to find out what is wrong with your XML source.<ref>ConTeXt will complain with a message in the PDF document starting with “invalid xml file”.</ref>
 +
 
 +
Knowing ConTeXt is required too, because typesetting XML may be explained as having two parts:
 +
 
 +
* Selecting what you want from the XML file(s).
 +
* Defining how you want your selections in the final PDF document.
 +
 
 +
It is better to start learning standard ConTeXt first (if required) and then acquire some experience with XML files. 
 +
 
 +
=First Example=
 +
 
 +
==Sample XML Source==
 +
 
 +
An XML sample borrowed and adapted from the net reads:
 +
 
 +
<xmlcode>
 +
<TEI xml:lang="en">
 +
  <teiHeader>
 +
    <!-- stuff omitted here -->
 +
  </teiHeader>
 +
  <text>
 +
    <body>
 +
      <div type="essay">
 +
        <head>An Essay on Summer</head>
 +
        <p>Summer school in <date when="1990">MCMXC</date> was never easy;
 +
        it went by too quickly and left us wanting more.</p>
 +
        <p>But, as my friend <name type="person">Peter</name> said with his
 +
        inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>,
 +
        <said>It never pays to think too hard</said>. Or, as I would rather
 +
        put it, <quote xml:lang="it">Que sera, sera</quote>.</p>
 +
      </div>
 +
      <div type="essay">
 +
        <head>An Essay on Winter</head>
 +
        <p xml:lang="es">¡Hasta la vista…!</p>
 +
      </div>
 +
    </body>
 +
  </text>
 +
</TEI>
 +
</xmlcode>
 +
 
 +
===Only XML Required===
 +
 
 +
This previous sample is written using the TEI markup. It is correct XML and valid (TEI) XML.
 +
 
 +
You might think XML correctness<ref>I’m aware that the technical term is well–formedness, not being able to avoid considering a more expressive replacement. Correctness seems to be a suitable candidate.</ref> as the set orthographical rules common to all European languages. Some of these rules may be:<ref>This is not more than a fancy example, in no way an exhaustive description (or list).</ref>
 +
 
 +
* All words are separated using at least a blank space.
 +
* Single dots mark different sentences.
 +
* Blank vertical space separates paragraph (when available.
 +
 
 +
XML rules describe how the tags inside the characters {{code|<…>}} are to be used. To these rules belong:
 +
 
 +
* Markup is defined by the string inside the characters {{code|< >}}.
 +
* Any blank space separates attributes (<code><element attribute="value" attribute1="value1"></code>).
 +
* The name is the only required part for the {{code|<…>}} tag.
 +
* Elements have opening tag and a matching closing tag ({{code|<…>}} and {{code|</…>}}), otherwise the opening tag must autoclose ({{code|<…/>}})<ref>With or without space before the slash.</ref>.
 +
* The name must come first in the tag (before the first space, if any attribute is given).
 +
* Attributes have their values assigned with the equal sign (and no blank space before or after the sign).
 +
* Attributes have their values enclosed in quotes.
 +
 
 +
Validity is related to a document type. XML validity is properly the document validity.
 +
 
 +
A document type (such as XHTML or TEI) defines a limited set of elements (of element names). Each element may contain one or more attributes with different values.
 +
 
 +
This specification of XML is called the document type definition. You may consider it as the set of grammar rules of each European language.
 +
 
 +
For example, `<whatever>` is a correct pure XML name, but it is invalid XHTML or TEI element.
 +
 
 +
An even more extreme sample of correct XML would read:
 +
 
 +
<xmlcode>
 +
<τεχτ>
 +
  <βοδυ>
 +
    <διβ type="essay">
 +
      <ἡαδ>An Essay on Summer</ἡαδ>
 +
      <π>Summer school in <δατη when="1990">MCMXC</δατη> was never easy;
 +
      it went by too quickly and left us wanting more.</π>
 +
      <π>But, as my friend <ναμη type="person">Peter</ναμη> said with his
 +
      inimitable <ξένον xml:lang="fr">je ne sais quoi</ξένον>,
 +
      <ἔφα>It never pays to think too hard</ἔφα>. Or, as I would rather
 +
      put it, <λεγόμενον xml:lang="it">Que sera, sera</λεγόμενον>.</π>
 +
    </διβ>
 +
    <διβ type="essay">
 +
      <ἡαδ>An Essay on Winter</ἡαδ>
 +
      <π xml:lang="es">¡Hasta la vista…!</π>
 +
    </διβ>
 +
  </βοδυ>
 +
</τεχτ>
 +
</xmlcode>
 +
 
 +
This is invalid TEI. But ConTeXt only requires correct (or valid, as it describes it) XML sources to compile them.
 +
 
 +
==Sample Environment==
 +
 
 +
A minimal configuration file or environment to typeset the previous sample may read:
 +
 
 +
<texcode>
 +
\startxmlsetups xml:presets:all
 +
  \xmlsetsetup {#1} {*} {xml:*}
 +
\stopxmlsetups
 +
 
 +
\xmlregistersetup{xml:presets:all}
 +
 
 +
\startxmlsetups xml:TEI
 +
  \mainlanguage[\xmlatt{#1}{xml:lang}]
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:body
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:date
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:div
 +
  \startchapter[title=\xmltext{#1}{head}]
 +
    \xmlflush{#1}
 +
  \stopchapter
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:foreign
 +
  \bgroup\language[\xmlatt{#1}{xml:lang}]\em\xmlflush{#1}\egroup
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:name
 +
  \bgroup\sc\xmlflush{#1}\egroup
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:p
 +
  \startparagraph
 +
    \xmlflush{#1}
 +
  \stopparagraph
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:p:date
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:quote
 +
  \bgroup\language[\xmlatt{#1}{xml:lang}]\quotation{\xmlflush{#1}}\egroup
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:said
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:teiHeader
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:text
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
</texcode>
 +
 
 +
A proper explanation in [[#XML_Typesetting|XML Typesetting]].
 +
 
 +
===How Does It Work?===
 +
 
 +
The XML source may be saved as <code>source.xml</code> and the environment or configuration file could be saved as <code>environment.tex</code>.<ref>Of course, file names should differ in documents. Although not being mandatory (as far as I can recall), it is a good idea to keep different file extensions for each file format. I mean, {{code|.xml}} for XML files and {{code|.tex}} for ConTeXt files.</ref>
 +
 
 +
<pre>context --environment=environment.tex source.xml</pre>
 +
 
 +
This invocation will generate an output file named {{code|source.pdf}}.
 +
 
 +
==XML Typesetting==
 +
 
 +
Formatting XML sources with ConTeXt (or properly typesetting them) requires:
 +
 
 +
* Selecting which parts you want to be typeset. At least, these selections will cover elements by their name.
 +
* Assigning these parts to single configuration commands (otherwise all will be displayed the same).
 +
 
 +
In practice, the ConTeXt configuration for XML (or environment file) contains:
 +
 
 +
# A set of XML (node) selections mapped or assigned to ConTeXt setups (or configurations).
 +
# The registration of this mapping (or assignation set).
 +
# The configuration of each setup.
 +
 
 +
A basic skeleton showing the three tasks would read:
 +
 
 +
<texcode>
 +
\startxmlsetups xml:whatever
 +
  \xmlsetsetup {#1} {*} {xml:*}
 +
\stopxmlsetups
 +
 
 +
\xmlregistersetup{xml:whatever}
 +
 
 +
\startxmlsetups xml:body
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
% and so many definitions as XML selections
 +
</texcode>
 +
 
 +
The two blank lines separate the three parts listed above.
 +
 
 +
===Mapping Selections===
 +
 
 +
The first thing to define is a list of selections from the XML source linked to invidual ConTeXt configurations.
 +
 
 +
This minimal sample contains it:
 +
 
 +
<texcode>
 +
\startxmlsetups xml:whatever
 +
  \xmlsetsetup {#1} {*} {xml:*}
 +
\stopxmlsetups
 +
</texcode>
 +
 
 +
# The first line {{cmd|startxmlsetups}} creates the list (named {{code|xml:whatever}}).
 +
## The same identifier will be required to register the list.
 +
## It is customary to use {{code|xml:}} as namespace, but any character string (such as {{code|οὑδέν:}}) would do.
 +
## Both parts of the name are free, but the identifier should match completely in the registration.
 +
# The third line {{cmd|stopxmlsetups}} closes the {{cmd|startxmlsetups}} (as customary in ConTeXt.
 +
# The second line {{cmd|xmlsetsetup}} assigns individual selections in XML with ConTeXt format.
 +
## In {{cmd|xmlsetsetup}}, the second pair of braces defines the individual XML selection, the third pair of braces defines the ConTeXt setup.
 +
## The content of the first pair of braces ({{cmd|xmlsetsetup|{#1}}}) is required in all cases. 
 +
 
 +
====XML Paths (or LPaths)====
 +
 
 +
You define what you want form the XML sources using XML Paths, known as XPaths. Since ConTeXt access these paths using Lua, they are LPaths.
 +
 
 +
We are handling the contents of the second pair of braces from the command:
 +
 
 +
<texcode>
 +
\xmlsetsetup{#1}{*}{xml:*}
 +
</texcode>
 +
 
 +
The most basic path is the one used in the sample {{code|{*}&#8203;}}, which stands for any XML element.
 +
 
 +
Other path types may be:
 +
 
 +
* {{code|{element[@attribute]}}}, selects <code><element attribute="…"></code> ({{code|<element>}} with {{code|attribute}} set, regardless of its value).
 +
 
 +
* {{code|1={element[@attribute='value']}}} selects <code><element attribute="value"></code>, but not <code><element attribute="value1"></code> (or even <code><element attribute="value another-value"></code>).
 +
 
 +
* {{code|{container/element}}} selects all {{code|<element>}} children (or direct descendants) of {{code|<container>}}.
 +
 
 +
There are a bunch of other possibilities and a separate page on [[LPaths]] would make more sense.
 +
 
 +
====Defining ConTeXt Setups====
 +
 
 +
The third and last pair of braces from {{cmd|xmlsetsetup|{#1}{*}{xml:*}}} defines the matching setup for the given element.
 +
 
 +
If you use wildcard ({{code|*}}) this will take the element name from the path (when a path is selected).
 +
 
 +
It is up to you which namespace you use to name ConTeXt setups,<ref>The part of the identifier with the form {{code|xml:}}, which may contain any string of letters (no digits).</ref> but they must match the individual formatting command.
 +
 
 +
A way of getting rid of some content (which otherwise would be selected) is to match a path with an non–existing selection.<ref>This is exactly what happens with the {{code|<head>}} element in the sample. There is no defined <texcode>\startxmlsetups xml:head
 +
  \xmlflush{#1}
 +
\stopxmlsetups</texcode>It would be redundant (appearing twice in the output document), since it is already included with {{code|xml:div}} with {{cmd|xmltext|{#1}{head}}}.</ref>
 +
 
 +
===Registering Maps===
 +
 
 +
After defining the list of XML setups (XML paths matched with ConTeXt setups), it must be registered. The registration command reads:
 +
 
 +
<texcode>
 +
\xmlregistersetup{xml:whatever}
 +
</texcode>
 +
 
 +
The only requirement is that the identifier ({{code|xml:whatever}} in the sample) is exactly the same that the one defined in {{cmd|startxmlsetups}}.
 +
 
 +
===Formatting===
 +
 
 +
Last (but not least, as they say) comes the format of XML selections. Without this step, the selections will be lost in the transition to the output document.
 +
 
 +
As already explained in [[#Defining_ConTeXt_Setups|Defining ConTeXt Setups]], these names (contained in the last pair of braces of {{cmd|xmlsetsetup}}) should match each indivual setup configuration.
 +
 
 +
For a setup named in the selection mapping {{code|{xml:body}}}, its configuration may read:
 +
 +
<texcode>
 +
\startxmlsetups xml:body
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
</texcode>
 +
 
 +
Flushing the contents of the element (the node), it is the most basic operation.
 +
 
 +
This is required to be able to have its children elements.
 +
 
 +
Flushing only adds the text of the element, but for formatting one needs standard ConTeXt command.
 +
 
 +
Compare the previous setup to these other ones:
 +
 
 +
<texcode>
 +
\startxmlsetups xml:p
 +
  \startparagraph
 +
    \xmlflush{#1}
 +
  \stopparagraph
 +
\stopxmlsetups
 +
 
 +
\startxmlsetups xml:name
 +
  \bgroup\sc\xmlflush{#1}\egroup
 +
\stopxmlsetups
 +
</texcode>
 +
 
 +
The {{code|xml:p}} setup adds the required commands so that {{code|&lt;p&gt;}} are handled as commands.
 +
 
 +
For {{code|xml:name}}, small caps are added. {{code|\bgroup…\egroup}} is similar to enclose its contents in braces (but more explicit and readable).
 +
 
 +
====Specific Commands====
 +
 
 +
As mentioned, {{cmd|xmlflush|{#1}}} flushes the current selection (or node).
 +
 
 +
This is the most basic operation, but there are other commands as well.
 +
 
 +
{{cmd|xmltext}} adds the text form an element
 +
 
 +
=Documents about XML in MkIV=
 +
 
 +
==General Information==
 +
 
 +
* [[manual:xml-mkiv.pdf|''Dealing with XML in ConTEXt MkIV'']]: the official manual that explains everything. Too hard to be a good starting point (unless you are confident [or at least familiar] with XPath).
 +
* [[manual:math-mkiv.pdf|''MathML in MkIV'']]: also official document to math typesetting with XML sources.
 +
* [[TEI_xml| TEI XMLl]]: example of TEI–encoded source typeset with ConTeXt.
 +
* [[DocBook]]: example of how to typeset DocBook sources with help of a ConTeXt module.
 +
* [[Formatting Objects|Formatting XML Objects]]: not (yet?) available for MkIV.
 +
* [[Verbatim_XML | Verbatim in XML]]: how to typeset XML sources verbatim in final text.
 +
* [[xtables#XML | Processing XML tables as Extreme Tables]]: example about XML tables as ConTeXt tables (extreme or natural).
 +
* [https://wiki.contextgarden.net/images/8/8c/xhtml.pdf XHTML in MkIV]: ''Getting Web Content and pdf-Output from One Source'', by Thomas Schmitz.
 +
* [[Ctx| Processing of Ctx XML files]]
 +
 
 +
==Processing XML with lua==
 +
* [[XML_Lua| XML in Lua]] (manipulating XML in Lua)
 +
 
 +
==XHTML in MKIV==
 +
* [https://wiki.contextgarden.net/images/8/8c/xhtml.pdf Thomas’ ''My Way'' on processing XHTML with MKIV]: ''Getting Web Content and pdf-Output from One Source'' (already mentioned).
 +
 
 +
=Documents about XML in MkII (obsolete)=
 +
 
 +
==XML/ConTeXt in general==
 +
* [[manual:example.pdf|''XML in ConTeXt'']] by Hans Hagen and Ton Otten (2001)
 +
* [https://tug.org/TUGboat/tb24-3/pepping.pdf ''Docbook In ConTEXt, a ConTEXt XML Mapping for DocBook Documents''] by Simon Pepping
 +
* [http://getfo.sourceforge.net/context_xml/index.html ConTeXt–XML] by Paul Tremblay
 +
* [https://www.pragma-ade.nl/general/magazines/mag-0008.pdf ''Dealing with XML''] by Hans Hagen (about XML, XSLT and typesetting without TeX code)
 +
* XML Basics: [[Mixing_XML_and_ConTeXt|Mixing XML and ConTeXt]] using the pre-defined ContML vocabulary
 +
 
 +
==Additions and Details of XML/ConTeXt==
 
* [[manual:xfigures-p.pdf|Figures (XML image databases)]] ([[manual:xfigures-s.pdf|screen]]) by Pragma (2001); see [[Image Database]]
 
* [[manual:xfigures-p.pdf|Figures (XML image databases)]] ([[manual:xfigures-s.pdf|screen]]) by Pragma (2001); see [[Image Database]]
 
* [[Two pass tag processing example]] (float and figure tags)
 
* [[Two pass tag processing example]] (float and figure tags)
Line 36: Line 368:
 
* [[manual:xcorresp.pdf|Serial Letters]] (using a XML database) by Pragma (2003)
 
* [[manual:xcorresp.pdf|Serial Letters]] (using a XML database) by Pragma (2003)
  
===eXaMpLe framework===  
+
==eXaMpLe framework==  
 
(batch processing)
 
(batch processing)
 
* [[manual:ex-ample.pdf|Example Interface]] (empty)
 
* [[manual:ex-ample.pdf|Example Interface]] (empty)
Line 42: Line 374:
 
* [[manual:ex-imple.pdf|Eximple Toolkit]] (simple subset of Example)
 
* [[manual:ex-imple.pdf|Eximple Toolkit]] (simple subset of Example)
  
===MathML===
+
==MathML==
 
* [[manual:pre-mml.pdf|MathML Intro presentation]] by Pragma
 
* [[manual:pre-mml.pdf|MathML Intro presentation]] by Pragma
 
* [[manual:mmlprime.pdf|MathML manual]] by Pragma (2001)
 
* [[manual:mmlprime.pdf|MathML manual]] by Pragma (2001)
Line 50: Line 382:
 
* [[manual:xphysml-p.pdf|PhysML (MathML extension for physics)]] ([[manual:xphysml-s.pdf|screen]]) by Pragma
 
* [[manual:xphysml-p.pdf|PhysML (MathML extension for physics)]] ([[manual:xphysml-s.pdf|screen]]) by Pragma
  
===XSL/FO===
+
==XSL/FO==
 
* XSL/FO: [[Formatting Objects]]
 
* XSL/FO: [[Formatting Objects]]
 
* [[ConTeXt FO and XML]] is a tutorial with a view to presenting ConTeXt from the XSL-FO mindset.
 
* [[ConTeXt FO and XML]] is a tutorial with a view to presenting ConTeXt from the XSL-FO mindset.
 +
 +
=Notes=
  
 
[[Category:XML]]
 
[[Category:XML]]

Latest revision as of 18:06, 10 June 2024

Introduction

Handling XML in ConTeXt has improved dramatically with the advent of MkIV.

The new Lua–based infrastructure makes typesetting, manipulating, filtering, and reusing XML much much easier than before.

Unfortunately, this means that most of the existing documentation is now obsolete.

In general, old MkII code includes the uppercase XML string in its commands (as in \getXMLcode[name]), while new MkIV code uses lowercase xml (as in \xmlflush{#1}).

Before You Start

It might be obvious, but there are two basic requirements to typeset XML sources with ConTeXt:

  1. Familiarity with XML. You don’t have to type XML directly, but ConTeXt isn’t able to compile well–formed XML.[1]
  2. At least, some knowledge of ConTeXt commands, since otherwise formatting what you select from the XML source would be impossible.

XML is way more powerful than being source format to typeset with ConTeXt. They are also completely independent from each other. It is important to deal with XML first without seeing it through ConTeXt lenses.

As for typing directly XML sources, there are some lightweight tagging (or markup) languages, such as AsciiDoc or Markdown.[2] There are tools (Pandoc being just one of them) that generate XML from these lightweight markup formats. It is not entirely impossible that in some cases these tools might generate wrong XML format (due to bugs in them). In that case, you will have to find out what is wrong with your XML source.[3]

Knowing ConTeXt is required too, because typesetting XML may be explained as having two parts:

  • Selecting what you want from the XML file(s).
  • Defining how you want your selections in the final PDF document.

It is better to start learning standard ConTeXt first (if required) and then acquire some experience with XML files.

First Example

Sample XML Source

An XML sample borrowed and adapted from the net reads:

<TEI xml:lang="en">
  <teiHeader>
    <!-- stuff omitted here -->

  </teiHeader>
  <text>
    <body>
      <div type="essay">
        <head>An Essay on Summer</head>
        <p>Summer school in <date when="1990">MCMXC</date> was never easy; 
        it went by too quickly and left us wanting more.</p>
        <p>But, as my friend <name type="person">Peter</name> said with his 
        inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>, 
        <said>It never pays to think too hard</said>. Or, as I would rather 
        put it, <quote xml:lang="it">Que sera, sera</quote>.</p>
      </div>
      <div type="essay">
        <head>An Essay on Winter</head>
        <p xml:lang="es">¡Hasta la vista…!</p>
      </div>
    </body>
  </text>
</TEI>

Only XML Required

This previous sample is written using the TEI markup. It is correct XML and valid (TEI) XML.

You might think XML correctness[4] as the set orthographical rules common to all European languages. Some of these rules may be:[5]

  • All words are separated using at least a blank space.
  • Single dots mark different sentences.
  • Blank vertical space separates paragraph (when available.

XML rules describe how the tags inside the characters <…> are to be used. To these rules belong:

  • Markup is defined by the string inside the characters < >.
  • Any blank space separates attributes (<element attribute="value" attribute1="value1">).
  • The name is the only required part for the <…> tag.
  • Elements have opening tag and a matching closing tag (<…> and </…>), otherwise the opening tag must autoclose (<…/>)[6].
  • The name must come first in the tag (before the first space, if any attribute is given).
  • Attributes have their values assigned with the equal sign (and no blank space before or after the sign).
  • Attributes have their values enclosed in quotes.

Validity is related to a document type. XML validity is properly the document validity.

A document type (such as XHTML or TEI) defines a limited set of elements (of element names). Each element may contain one or more attributes with different values.

This specification of XML is called the document type definition. You may consider it as the set of grammar rules of each European language.

For example, <whatever> is a correct pure XML name, but it is invalid XHTML or TEI element.

An even more extreme sample of correct XML would read:

<τεχτ>
  <βοδυ>
    <διβ type="essay">
      <ἡαδ>An Essay on Summer</ἡαδ>
      <π>Summer school in <δατη when="1990">MCMXC</δατη> was never easy;
      it went by too quickly and left us wanting more.<>
      <π>But, as my friend <ναμη type="person">Peter</ναμη> said with his
      inimitable <ξένον xml:lang="fr">je ne sais quoi</ξένον>,
      <ἔφα>It never pays to think too hard</ἔφα>. Or, as I would rather
      put it, <λεγόμενον xml:lang="it">Que sera, sera</λεγόμενον>.<>
    </διβ>
    <διβ type="essay">
      <ἡαδ>An Essay on Winter</ἡαδ>
      <π xml:lang="es">¡Hasta la vista…!<>
    </διβ>
  </βοδυ>
</τεχτ>

This is invalid TEI. But ConTeXt only requires correct (or valid, as it describes it) XML sources to compile them.

Sample Environment

A minimal configuration file or environment to typeset the previous sample may read:

\startxmlsetups xml:presets:all
  \xmlsetsetup {#1} {*} {xml:*}
\stopxmlsetups

\xmlregistersetup{xml:presets:all}

\startxmlsetups xml:TEI
  \mainlanguage[\xmlatt{#1}{xml:lang}]
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:body
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:date
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:div
  \startchapter[title=\xmltext{#1}{head}]
    \xmlflush{#1}
  \stopchapter
\stopxmlsetups

\startxmlsetups xml:foreign
  \bgroup\language[\xmlatt{#1}{xml:lang}]\em\xmlflush{#1}\egroup
\stopxmlsetups

\startxmlsetups xml:name
  \bgroup\sc\xmlflush{#1}\egroup
\stopxmlsetups

\startxmlsetups xml:p
  \startparagraph
    \xmlflush{#1}
  \stopparagraph
\stopxmlsetups

\startxmlsetups xml:p:date
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:quote
  \bgroup\language[\xmlatt{#1}{xml:lang}]\quotation{\xmlflush{#1}}\egroup
\stopxmlsetups

\startxmlsetups xml:said
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:teiHeader
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:text
  \xmlflush{#1}
\stopxmlsetups

A proper explanation in XML Typesetting.

How Does It Work?

The XML source may be saved as source.xml and the environment or configuration file could be saved as environment.tex.[7]

context --environment=environment.tex source.xml

This invocation will generate an output file named source.pdf.

XML Typesetting

Formatting XML sources with ConTeXt (or properly typesetting them) requires:

  • Selecting which parts you want to be typeset. At least, these selections will cover elements by their name.
  • Assigning these parts to single configuration commands (otherwise all will be displayed the same).

In practice, the ConTeXt configuration for XML (or environment file) contains:

  1. A set of XML (node) selections mapped or assigned to ConTeXt setups (or configurations).
  2. The registration of this mapping (or assignation set).
  3. The configuration of each setup.

A basic skeleton showing the three tasks would read:

\startxmlsetups xml:whatever
  \xmlsetsetup {#1} {*} {xml:*}
\stopxmlsetups

\xmlregistersetup{xml:whatever}

\startxmlsetups xml:body
  \xmlflush{#1}
\stopxmlsetups
% and so many definitions as XML selections

The two blank lines separate the three parts listed above.

Mapping Selections

The first thing to define is a list of selections from the XML source linked to invidual ConTeXt configurations.

This minimal sample contains it:

\startxmlsetups xml:whatever
  \xmlsetsetup {#1} {*} {xml:*}
\stopxmlsetups
  1. The first line \startxmlsetups creates the list (named xml:whatever).
    1. The same identifier will be required to register the list.
    2. It is customary to use xml: as namespace, but any character string (such as οὑδέν:) would do.
    3. Both parts of the name are free, but the identifier should match completely in the registration.
  2. The third line \stopxmlsetups closes the \startxmlsetups (as customary in ConTeXt.
  3. The second line \xmlsetsetup assigns individual selections in XML with ConTeXt format.
    1. In \xmlsetsetup, the second pair of braces defines the individual XML selection, the third pair of braces defines the ConTeXt setup.
    2. The content of the first pair of braces (\xmlsetsetup{#1}) is required in all cases.

XML Paths (or LPaths)

You define what you want form the XML sources using XML Paths, known as XPaths. Since ConTeXt access these paths using Lua, they are LPaths.

We are handling the contents of the second pair of braces from the command:

\xmlsetsetup{#1}{*}{xml:*}

The most basic path is the one used in the sample {*}​, which stands for any XML element.

Other path types may be:

  • {element[@attribute]}, selects <element attribute="…"> (<element> with attribute set, regardless of its value).
  • {element[@attribute='value']} selects <element attribute="value">, but not <element attribute="value1"> (or even <element attribute="value another-value">).
  • {container/element} selects all <element> children (or direct descendants) of <container>.

There are a bunch of other possibilities and a separate page on LPaths would make more sense.

Defining ConTeXt Setups

The third and last pair of braces from \xmlsetsetup{#1}{*}{xml:*} defines the matching setup for the given element.

If you use wildcard (*) this will take the element name from the path (when a path is selected).

It is up to you which namespace you use to name ConTeXt setups,[8] but they must match the individual formatting command.

A way of getting rid of some content (which otherwise would be selected) is to match a path with an non–existing selection.[9]

Registering Maps

After defining the list of XML setups (XML paths matched with ConTeXt setups), it must be registered. The registration command reads:

\xmlregistersetup{xml:whatever}

The only requirement is that the identifier (xml:whatever in the sample) is exactly the same that the one defined in \startxmlsetups.

Formatting

Last (but not least, as they say) comes the format of XML selections. Without this step, the selections will be lost in the transition to the output document.

As already explained in Defining ConTeXt Setups, these names (contained in the last pair of braces of \xmlsetsetup) should match each indivual setup configuration.

For a setup named in the selection mapping {xml:body}, its configuration may read:

\startxmlsetups xml:body
  \xmlflush{#1}
\stopxmlsetups

Flushing the contents of the element (the node), it is the most basic operation.

This is required to be able to have its children elements.

Flushing only adds the text of the element, but for formatting one needs standard ConTeXt command.

Compare the previous setup to these other ones:

\startxmlsetups xml:p
  \startparagraph
    \xmlflush{#1}
  \stopparagraph
\stopxmlsetups

\startxmlsetups xml:name
  \bgroup\sc\xmlflush{#1}\egroup
\stopxmlsetups

The xml:p setup adds the required commands so that <p> are handled as commands.

For xml:name, small caps are added. \bgroup…\egroup is similar to enclose its contents in braces (but more explicit and readable).

Specific Commands

As mentioned, \xmlflush{#1} flushes the current selection (or node).

This is the most basic operation, but there are other commands as well.

\xmltext adds the text form an element

Documents about XML in MkIV

General Information

Processing XML with lua

XHTML in MKIV

Documents about XML in MkII (obsolete)

XML/ConTeXt in general

Additions and Details of XML/ConTeXt

eXaMpLe framework

(batch processing)

MathML

XSL/FO

Notes

  1. If this is all Greek to you, consider it as incorrect XML.
  2. For a detailed list, see a feature comparison list in Wikipedia.
  3. ConTeXt will complain with a message in the PDF document starting with “invalid xml file”.
  4. I’m aware that the technical term is well–formedness, not being able to avoid considering a more expressive replacement. Correctness seems to be a suitable candidate.
  5. This is not more than a fancy example, in no way an exhaustive description (or list).
  6. With or without space before the slash.
  7. Of course, file names should differ in documents. Although not being mandatory (as far as I can recall), it is a good idea to keep different file extensions for each file format. I mean, .xml for XML files and .tex for ConTeXt files.
  8. The part of the identifier with the form xml:, which may contain any string of letters (no digits).
  9. This is exactly what happens with the <head> element in the sample. There is no defined
    \startxmlsetups xml:head
      \xmlflush{#1}
    \stopxmlsetups
    It would be redundant (appearing twice in the output document), since it is already included with xml:div with \xmltext{#1}{head}.